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ABSTRACT 



A multiplier configured to perform multiplication of both 
scalar floating point values (XxY) and packed floating point 
values (i.e., X 1 xY 1 and X 2 xY^). In addition, the multiplier 
may be configured to calculate XxY-Z. The multiplier 
comprises selection logic for selecting source operands, a 
partial product generator, an adder tree, and two or more 
adders configured to sum the results from the adder tree to 
achieve a final result. The multiplier may also be configured 
to perform iterative multiplication operations to implement 
such arithmetical operations such as division and square 
root. The multiplier may be configured to generate two 
versions of the final result, one assuming there is an 
overflow, and another assuming there is not an overflow. A 
computer system and method for performing multiplication 
are also disclosed. 

20 Claims, 11 Drawing Sheets 
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SHARED FP AND SIMD 3D MULTIPLIER 

RELATED APPLICATIONS 

This application is a continuation-in-part of the following: 
(i) U.S. application Ser. No. 09/014,454, filed on Jan. 1, 
1998, now U.S. Pat. No. : 6,144,980; which is hereby 
incorporated by reference in its entirety; (ii) U.S. application 
Ser. No. 09/049,752, filed on Mar. 27, 1998, now U.S. Pat. 
No. : 6,269,384 which is hereby incorporated by reference 
in its entirety; and (iii) U.S. application Ser. No. 09/134,171, 
filed on Aug. 14, 1998, now U.S. Pat. No. : 6,223,198 which 
is hereby incorporated by reference in its entirety. 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

This invention relates generally to the field of micropro- 
cessors and, more particularly, to multipliers that are usable 
to perform floating point calculations in microprocessors. 

2. Description of the Related Art 
Microprocessors are typically designed with a number of 

"execution units" that are each optimized to perform a 
particular set of functions or instructions. For example, one 
or more execution units within a microprocessor may be 
optimized to perform memory accesses, i.e., load and store 
operations. Other execution units may be optimized to 
perform general arithmetic and logic functions, e.g., shifts 
and compares. Many microprocessors also have specialized 
execution units configured to perform more complex 
floating-point arithmetic operations including multiplication 
and reciprocal operations. These specialized execution units 
typically comprise hardware that is optimized to perform 
one or more floating-point arithmetic functions. 

Most microprocessors must support multiple data types. 
For example, x86 compatible microprocessors must execute 
instructions that are defined to operate upon an integer data 
type and instructions that are defined to operate upon 
floating-point data types. Floating-point data can represent 
numbers within a much larger range than integer data. For 
example, a 32-bit signed integer can represent the integers 
between -23 31 and 2 31 -1 (using two's complement format). 
In contrast, a 32-bit ("single precision") floating-point num- 
ber as defined by the Institute of Electrical and Electronic 
Engineers (IEEE) Standard 754 has a range (in normalized 
format) from 2" 12e to 2 327 x(2-2" 23 ) in both positive and 
negative numbers. 

Turning now to FIG. 1A, an exemplary format for an 8-bit 
integer 100 is shown. As illustrated in the figure, negative 
integers are represented using the two's complement format 
104. To negate an integer, all bits are inverted to obtain the 
one's complement format 102. A constant of one is then 
added to the least significant bit (LSB). 

Turning now to FIG. IB, an exemplary format for a 32-bit 
(single precision) floating-point number 106 is shown. A 
floating-point number is represented by a significand, an 
exponent and a sign bit. The base for the floating-point 
number is raised to the power of the exponent and multiplied 
by the significand to arrive at the number represented. In 
microprocessors, base 2 is typically used. The significand 
comprises a number of bits used to represent the most 
significant digits of the number. Typically, the significand 
comprises one bit to the left of the radix point and the 
remaining bits to the right of the radix point. In order to save 
space, the bit to the left of the radix point, known as the 
integer bit, is not explicitly stored. Instead, it is implied in 
the format of the number. Additional information regarding 
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floating-point numbers and operations performed thereon 
may be obtained in IEEE Standard 754 (IEEE-754)., Unlike 
the integer representation, two's complement format is not 
typically used in the floating-point representation. Instead, 
5 sign and magnitude form are used. Thus, only the sign bit is 
changed when converting from positive value 106 to nega- 
tive value 108. 

In the x86 architecture, the floating point format supports 
a number of special cases. These special cases may appear 

10 in one or more operands or one or more results for a 
particular instruction. FIG. 2 illustrates the sign, exponent, 
and significand formats of special and exceptional cases that 
are included in the IEEE-754 floating-point standard. The 
special and exceptional cases shown in FIG. 2 include a zero 

15 value, an infinity value, NaN (not-a-number) values, and a 
denormal value. An V in FIG. 2 represents a value that can 
be either one or zero. NaN values may include a QNaN 
(quiet not-a-number) value and a SNaN (signaling not-a- 
number) value as defined by a particular architecture. The 

20 numbers depicted in FIG. 2 are shown in base 2 format as 
indicated by the subscript 2 following each number. As 
shown, a number with all zeros in its exponent and signifi- 
cand represents a zero value in the IEEE-754 floating-point 
standard. A number with all ones in its exponent, a one in the 

25 most significant bit of its significand, and zeros in the 
remaining bits of its significant represents an infinity value. 
The remaining special and exceptional cases are depicted 
similarly. 

Given the substantial differences in floating point and 

30 integer formats, microprocessor designers have typically 
used two sets of execution units, i.e., one set optimized to 
perform arithmetic on integer instructions and one set opti- 
mized to perform arithmetic on floating point instructions. 
Unfortunately, this approach has some potential drawbacks. 
Die space on a microprocessor is a relatively scarce 
commodity, and the die space required to implement com- 
plex execution units such as multipliers is significant. Thus, 
duplicating multipliers for both integer and floating point 
formats consumes precious real estate that could be used to 

40 implement additional functionality. 

The recent addition of three-dimensional graphics instruc- 
tions (e.g., AMD's 3DNow™ instructions) to the standard 
x86 instruction has further complicated matters by increas- 

45 ing the performance demands on the microprocessor's arith- 
matic execution units (and multiplier execution units in 
particular). As those skilled in the art will appreciate, 
3DNow™ instructions are so-called SIME (single 
instruction, multiple data) instructions that have operands 

5Q that include multiple floating point values packed together. 
As a result, a method for executing arithmetic instructions 
with different instruction and data formats is needed. In 
particular, a method for executing multiply instructions 
having different data types without dramatically increasing 

55 the die space used is desired. 

SUMMARY 

The problems outlined above are in large part solved by 
the multiplier and method for performing multiplication 

60 described herein. In one embodiment, a single multiplier 
may be configured to perform scalar floating point multi- 
plication and packed floating point multiplication, e.g., 
single-instruction multiple-data (SIMD) multiplication. 
The multiplier may include selection logic that is config- 

65 ured to select a multiplier operand and a multiplicand 
operand from among a plurality of different potential 
sources, wherein the potential sources may include one or 
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more of the following: a floating point operand, a packed 
floating point operand, or the result of a previous iterative 
multiplication instruction. For example, the multiplier may 
perform scalar 90-bit format floating point multiplication 
(i.e., single, double, extended, or internal precision), packed 
90-bit format (i.e., 2 packed 32-bit floating point values); or 
the results of a previous multiplication instruction (i.e., via 
an internal bypass mechanism for instructions such as itera- 
tive multiplication operations). For example, some instruc- 
tions such as reciprocal instructions (used to perform 
division) and square root instructions may be implemented 
using iterative algorithms that perform a number of different 
multiplication operations before obtaining the correct result. 
In these situations it may particularly advantageous for a 
multiplier to have the capability of internally bypassing the 
results of a previous multiplication operation directly to the 
selection logic for another multiplication operation. 

In some embodiments, square root and divide instructions 
may be translated into a series of special multiplication 
opcodes, wherein each multiplication opcode is configured 
to perform a particular function in addition to standard 
multiplication. For example, one of the iterative special 
multiplication instructions may be a "BACKMUL" instruc- 
tion that accepts three source operands, A, B, and Q, and 
calculates the value B-Q-A. 

In some embodiments, the multiplier may be further 
configured to detect divide operations having a multiplier 
that is exactly a power of two. The multiplier may be 
configured to execute these divide operations without pro- 
ceeding through the entire iterative division process. Instead, 
the multiplier may be configured to shift the exponent and 
round the significand to the appropriate precision without 
significant additional hardware. 

In some embodiments, the multiplier may be further 
configured to perform independent multiplication instruc- 
tions during the idle clock cycles that may occur during 
complex iterative instructions such as square root and non- 
power of two divides. 

BRIEF DESCRIPTION OF THE DRAWINGS 
Other objects and advantages of the invention will 
become apparent upon reading the following detailed 
description and upon reference to the accompanying draw- 
ings in which: 

FIG. 1A is a diagram of an exemplary integer data format 
using two's complement representation. 

FIG. IB is a diagram of an exemplary floating-point data 
format. 

FIG. 2 is a table listing special cases for a defined 
floating-point data format. 

FIG. 3 is a block diagram of one embodiment of an 
exemplary microprocessor. 

FIG. 4 is a block diagram of one embodiment of a 
floating-point unit from the exemplary microprocessor of 
FIG. 3. 

FIG. 5 is a block diagram depicting portions of one 
embodiment of the multiplier from FIG. 4. 

FIG. 6 is a table illustrating rounding constants for one 
embodiment of the multiplier from FIG. 5. 

FIG. 7 is a block diagram illustrating generally one 
embodiment of the binary adder tree from the multiplier of 
FIG. 5. 

FIG. 8 is a block diagram illustrating more details of one 
embodiment of the multiplier from FIG. 7. 

FIG. 9 is a block diagram illustrating more details of one 
embodiment of the selection logic of the multiplier from 
FIG. 5. 
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FIG. 10 is a table of rounding actions for one embodiment 
of the multiplier from FIG. 5. 

FIG. U is a block diagram of one embodiment of a 
computer system configured to utilize the microprocessor of 
FIG. 1. 

While the invention is susceptible to various modifica- 
tions and alternative forms, specific embodiments thereof 
are shown by way of example in the drawings and will 
herein be described in detail. It should be understood, 
however, that the drawings and detailed description thereto 
are not intended to limit the invention to the particular form 
disclosed, but on the contrary, Lhe intention is to cover all 
modifications, equivalents, and alternatives falling within 
the spirit and scope of the present invention as defined by the 
appended claims. 

DETAILED DESCRIPTION OF SEVERAL 
EMBODIMENTS 

Turning now to FIG. 3, a block diagram of one embodi- 
ment of a microprocessor 10 is shown. Microprocessor 10 
includes a prefetch/predecode unit 12, a branch prediction 
unit 14, an instruction cache 16, an instruction alignment 
unit 18, a plurality of decode units 20A-20C, a plurality of 
reservation stations 22A-22C, a plurality of functional units 
24A-24C, a load/store unit 26, a data cache 28, a register file 
30, a reorder buffer 32, an MROM unit 34, and a floating- 
point unit (FPU) 36, which in turn comprises multiplier 50. 
Note that elements referred to herein with a particular 
reference number followed by a letter may be collectively 
referred to by the reference number alone. For example, 
decode units 20A-20C may be collectively referred to as 
decode units 20. 

Prefetch/predecode unit 12 is coupled to receive instruc- 
tions from a main memory subsystem (not shown), and is 
further coupled to instruction cache 16 and branch predic- 
tion unit 14. Similarly, branch prediction unit 14 is coupled 
to instruction cache 16. Still further, branch prediction unit 
14 is coupled to decode units 20 and functional units 24. 
Instruction cache 16 is further coupled to MROM unit 34 
and instruction alignment unit 18. Instruction alignment unit 
18, which comprises an early decode unit (EDU) 44, is in 
turn coupled to decode units 20. Each decode unit 20A-20C 
is coupled to load/store unit 26 and to respective reservation 
stations 22A-22C. Reservation stations 22A-22C are fur- 
ther coupled to respective functional units 24A-24C. 
Additionally, decode units 20 and reservation stations 22 are 
coupled to register file 30 and reorder buffer 32. Functional 
units 24 are coupled to load/store unit 26, register file 30, 
and reorder buffer 32 as well. Data cache 28 is coupled to 
load/store unit 26 and to the main memory subsystem. 
MROM unit 34, which also comprises an early decode unit 
(EDU) 42 is coupled to decode units 20 and FPU 36. Finally, 
FPU 36 is coupled to load/store unit 26 and reorder buffer 
32. 

Instruction cache 16 is a high speed cache memory 
provided to store instructions. Instructions are fetched from 
instruction cache 16 and dispatched to decode units 20. In 
one embodiment, instruction cache 16 is configured to store 
up to 64 kilobytes of instructions in a 2- way set associative 
structure having 64-byte lines (a byte comprises 8 binary 
bits). It is noted that instruction cache 16 may be imple- 
mented in a fully- associative, set-associative, or direct- 
mapped configuration. 

Instructions are fetched from main memory and stored 
into instruction cache 16 by prefetch/predecode unit 12. 
Instructions may be prefetched prior to the request thereof in 
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accordance with a prefetch scheme. A variety of prefetch 
schemes may be employed by prefetch/predecode unit 12. 
As prefetch/predecode unit 12 transfers instructions from 
main memory to instruction cache 16, prefetch/predecode 
unit 12 generates three predecode bits for each byte of the 5 
instructions: a start bit, an end bit, and a functional bit. The 
predecode bits form tags indicative of the boundaries of each 
instruction. The predecode tags may also convey additional 
information such as whether a given instruction may be 
decoded directly by decode units 20 or whether the instruc- ao 
tion is executed by invoking a microcode procedure con- 
trolled by MROM unit 34, as will be described in greater 
detail below. Still further, prefetch/predecode unit 12 may be 
configured to detect branch instructions and to store branch 
prediction information corresponding to the branch instruc- as 
tions into branch prediction unit 14. 

One encoding of the predecode tags for an embodiment of 
microprocessor 10 employing a variable byte length instruc- 
tion set will next be described. A variable byte length 
instruction set is an instruction set in which different instruc- 20 
tions may occupy differing numbers of bytes. An exemplary 
variable byte length instruction set employed by one 
embodiment of microprocessor 10 is the x86 instruction set. 

In the exemplary encoding, if a given byte is the first byte 
of an instruction, the start bit for that byte is set. If the byte 25 
is the last byte of an instruction, the end bit for that byte is 
set. Instructions which may be directly decoded by decode 
units 20 are referred to as "fast path" instructions. The 
remaining x86 instructions are referred to as MROM 
instructions, according to one embodiment. For fast path 30 
instructions, the functional bit is set for each prefix byte 
included in the instruction, and cleared for other bytes. 
Alternatively, for MROM instructions, the functional bit is 
cleared for each prefix byte and set for other bytes. The type 
of instruction may be determined by examining the func- 35 
tional bit corresponding to the end byte. If that functional bit 
is clear, the instruction is a fast path instruction. Conversely, 
if that functional bit is set, the instruction is an MROM 
instruction. The opcode of an instruction may thereby be 
located within an instruction which may be directly decoded 40 
by decode units 20 as the byte associated with the first clear 
functional bit in the instruction. For example, a fast path 
instruction including two prefix bytes, a Mod RIM byte, and 
an immediate byte would have start, end, and functional bits 
as follows: 45 



Start bits a 0000 

End bits 0000a 
Functional bits aiOOO 



According to one particular embodiment, early identify- 
ing that an instruction includes a scale -index-base (SIB) byte 
is advantageous for MROM unit 34. For such an 55 
embodiment, if an instruction includes at least two bytes 
after the opcode byte, the functional bit for the Mod R/M 
byte indicates the presence of an SIB byte. If the functional 
bit for the Mod R/M byte is set, then an SIB byte is present. 
Alternatively, if the functional bit for the Mod R/M byte is eo 
clear, then an SIB byte is not present. 

MROM instructions are instructions which are deter- 
mined to be too complex for decode by decode units 20. 
MROM instructions are executed by invoking MROM unit 
34. More specifically, when an MROM instruction is 65 
encountered, MROM unit 34 parses and issues the instruc- 
tion into a subset of defined fast path instructions to effec- 
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tuate the desired operation. MROM unit 34 dispatches the 
subset of fast path instructions to decode units 20. 

Microprocessor 10 employs branch prediction in order to 
speculatively fetch instructions subsequent to conditional 
branch instructions. Branch prediction unit 14 is included to 
perform branch prediction operations. In one embodiment, 
up to two branch target addresses are stored with respect to 
each 16 byte portion of each cache line in instruction cache 
16. Prefetch/predecode unit 12 determines initial branch 
targets when a particular line is predecoded. Subsequent 
updates to the branch targets corresponding to a cache line 
may occur due to the execution of instructions within the 
cache line. Instruction cache 16 provides an indication of the 
instruction address being fetched, so that branch prediction 
unit 14 may determine which branch target addresses to 
select for forming a branch prediction. Decode units 20 and 
functional units 24 provide update information to branch 
prediction unit 14. Because branch prediction unit 14 stores 
two targets per 16 byte portion of the cache line, some 
branch instructions within the line may not be stored in 
branch prediction unit 14. Decode units 20 detect branch 
instructions which were not predicted by branch prediction 
unit 14. Functional units 24 execute the branch instructions 
and determine if the predicted branch direction is incorrect. 
The branch direction may be "taken", in which subsequent 
instructions are fetched from the target address of the branch 
instruction. Conversely, the branch direction may be "not 
taken*', in which case subsequent instructions are fetched 
from memory locations consecutive to the branch instruc- 
tion. When a mispredicted branch instruction is detected, 
instructions subsequent to the mispredicted branch are dis- 
carded from the various units of microprocessor 10. A 
variety of suitable branch prediction algorithms may be 
employed by branch prediction unit 14. 

Instructions fetched from instruction cache 16 are con- 
veyed to instruction alignment unit 18 and MROM unit 34. 
As instructions are fetched from instruction cache 16, the 
corresponding predecode data is scanned to provide infor- 
mation to instruction alignment unit 18 and MROM unit 34 
regarding the instructions being fetched. Instruction align- 
ment unit 18 utilizes the scanning data to align an instruction 
to each of multiplexers 46A-C. In one embodiment, instruc- 
tion alignment unit 18 aligns instructions from three sets of 
eight instruction bytes to decode units 20. Similarly, MROM 
unit 34 is configured to output up to three aligned instruc- 
tions to multiplexers 46A-C. Note, both instruction align- 
ment unit 18 and MROM unit 34 may each have an early 
decode unit (EDC) 42 and 44. These units may perform the 
first steps of the decoding process, e.g., identifying the 
operand specifiers for each instruction. 

Each multiplexer 46A-C is configured to receive a par- 
tially decoded instruction (and corresponding decode and 
predecode information) from instruction alignment unit 18 
and MROM unit 34. Each multiplexer 46A-C is configured 
to select up to one instruction from either instruction align- 
ment unit 18 or MROM unit 34 during each clock cycle. The 
selected instructions are routed to decode units 20A-C 
(integer instructions), and FPU 36 (x86 floating-point, 
MMX, and 3DX instructions). In one embodiment of micro- 
processor 10, up to three floating-point instructions per 
clock cycle may be conveyed to floating-point unit 36. As 
note above, the instructions may come from MROM unit 34 
(microcode instructions) or instruction alignment unit 18 
(fast path instructions). Decode units 20 are configured to 
complete decoding instructions received from multiplexers 
46A-C. Register operand information is detected and routed 
to register file 30 and reorder buffer 32. Additionally, if the 
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instructions require one or more memory operations to be determined to modify the contents of one of the real regis- 

performed, decode units 20 dispatch the memory operations ters. Therefore, at various points during execution of a 

to load/store unit 26. Each instruction is decoded into a set particular program, reorder buffer 32 may have one or more 

of control values for functional units 24, and these control locations which contain the speculatively executed contents 

values are dispatched to reservation stations 22 along with 5 of a given register. If following decode of a given instruction 

operand address information and displacement or immediate it is determined that reorder buffer 32 has a previous location 

data which may be included with the instruction. or locations assigned to a register used as an operand in the 

Microprocessor 10 supports out of order execution, and given instruction, the reorder buffer 32 forwards to the 

thus employs reorder buffer 32 to keep track of the original corresponding reservation station either: 1) the value in the 

program sequence for register read and write operations, to 10 most recently assigned location, or 2) a tag for the most 

implement register renaming, to allow for speculative recently assigned location if the value has not yet been 

instruction execution and branch misprediction recovery, produced by the functional unit that will eventually execute 

and to facilitate precise exceptions. A temporary storage the previous instruction. If reorder buffer 32 has a location 

location within reorder buffer 32 is reserved upon decode of reserved for a given register, the operand value (or reorder 

an instruction that involves the update of a register to 15 buffer tag) is provided from reorder buffer 32 rather than 

thereby store speculative register states. If a branch predic- from register file 30. If there is no location reserved for a 

tion is incorrect, the results of speculatively-executed required register in reorder buffer 32, the value is taken 

instructions along the mispredicted path can be invalidated directly from register file 30. If the operand corresponds to 

in the buffer before they are written to register file 30. a memory location, the operand value is provided to the 

Similarly, if a particular instruction causes an exception, 2 o reservation station through load/store unit 26. 

instructions subsequent to the particular instruction may be In one particular embodiment, reorder buffer 32 is con- 

discarded. In this manner, exceptions are "precise" (i.e., figured to store and manipulate concurrently decoded 

instructions subsequent to the particular instruction causing instructions as a unit. This configuration will be referred to 

the exception are not retired prior to the exception). Stated herein as "line-oriented". By manipulating several instruc- 

another way, while some instructions following the 2 5 tions together, the hardware employed within reorder buffer 

exception-causing instruction may have been executed 32 may be simplified. For example, a line-oriented reorder 

before the exception, their results have not been committed buffer included in the present embodiment allocates storage 

to the microprocessor's architectural state. It is noted that a sufficient for instruction information pertaining to three 

particular instruction is speculatively executed if it is instructions (one from each decode unit 20) whenever one or 

executed prior to instructions which precede the particular 30 more instructions are dispatched by decode units 20. By 

instruction in program order. Preceding instructions may be contrast, a variable amount of storage is allocated in con- 

a branch instruction or an exception-causing instruction, in ventional reorder buffers, dependent upon the number of 

which case the speculative results may be discarded by instructions actually dispatched. A comparatively larger 

reorder buffer 32. number of logic gates may be required to allocate the 

The instruction control values and immediate or displace- 35 variable amount of storage. When each of the concurrently 

ment data provided at the outputs of decode units 20 are decoded instructions has executed, the instruction results are 

routed directly to respective reservation stations 22. In one stored into register file 30 simultaneously. The storage is 

embodiment, each reservation station 22 is capable of hold- then free for allocation to another set of concurrently 

ing instruction information (i.e., instruction control values as decoded instructions. Additionally, the amount of control 

well as operand values, operand tags and/or immediate data) 40 logic circuitry employed per instruction is reduced because 

for up to six pending instructions awaiting issue to the the control logic is amortized over several concurrently 

corresponding functional unit. It is noted that for the decoded instructions. A reorder buffer tag identifying a 

embodiment of FIG. 3, each reservation station 22 is asso- particular instruction may be divided into two fields: a line 

ciated with a dedicated functional unit 24, Accordingly, tag and an offset tag. The line tag identifies the set of 

three dedicated "issue positions" are formed by reservation 45 concurrently decoded instructions including the particular 

stations 22 and functional units 24. In other words, issue instruction, and the offset tag identifies which instruction 

position 0 is formed by reservation station 22A and func- within the set corresponds to the particular instruction. It is 

tional unit 24A. Instructions aligned and dispatched to noted that storing instruction results into register file 30 and 

reservation station 22Aare executed by functional unit 24A. freeing the corresponding storage is referred to as "retiring" 

Similarly, issue position 1 is formed by reservation station 50 the instructions. It is further noted that any reorder buffer 

22B and functional unit 24B; and issue position 2 is formed configuration may be employed in various embodiments of 

by reservation station 22C and functional unit 24C. microprocessor 10. 

Upon decode of a particular instruction, if a required As noted earlier, reservation stations 22 store instructions 

operand is a register location, register address information is until the instructions are executed by the corresponding 

routed to reorder buffer 32 and register file 30 simulta- 55 functional unit 24. An instruction is selected for execution if 

neously. Those of skill in the art will appreciate that the x86 both: (i) the operands of the instruction have been provided; 

register file includes eight 32 bit real registers (i.e., typically and (ii) the operands have not yet been provided for instruc- 

referred to as EAX, EBX, ECX, EDX, EBP, ESI, EDI and tions which are within the same reservation station 

ESP). In embodiments of microprocessor 10 which employ 22A-22C and which are prior to the instruction in program 

the x86 microprocessor architecture, register file 30 com- 60 order. It is noted that when an instruction is executed by one 

prises storage locations for each of the 32 bit real registers. of the functional units 24, the result of that instruction is 

Additional storage locations may be included within register passed directly to any reservation stations 22 that are waiting 

file 30 for use by MROM unit 34. Reorder buffer 32 contains for that result at the same time the result is passed to update 

temporary storage locations for results which change the reorder buffer 32 (this technique is commonly referred to as 

contents of these registers to thereby allow out of order 65 "result forwarding"). An instruction may be selected for 

execution. A temporary storage location of reorder buffer 32 execution and passed to a functional unit 24A-24C during 

is reserved for each instruction which, upon decode, is the clock cycle that the associated result is forwarded. 
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Reservation stations 22 route the forwarded result to the sixty-four kilobytes of data in a two way set associative 

functional unit 24 in this case. structure. It is understood that data cache 28 may be imple- 

In one embodiment, each of the functional units 24 is rnented in a variety of specific memory configurations, 

configured to perform integer arithmetic operations of addi- including a set associative configuration, 

tion and subtraction, as well as shifts, rotates, logical 5 Turning now to FIG. 4, details of one embodiment of FPU 

operations, and branch operations. The operations are per- 36 are shown - °&er embodiments are possible and contem- 

formed in response to the control values decoded for a plated. FPU 36 is a high performance out-of-order execution 

particular instruction by decode units 20. Additionally, func- umt ca P able of accepting up to three new instructions per 

tional units 24 may be configured to perform address gen- clock c y cle< ^ three inslruc tions may be any combination 

eration for load and store memory operations performed by i° of x86 floating-point instructions, MMX instructions, or 

load/store unit 26. In one particular embodiment, each 3DX wstructions. MMX and 3 °X instructions are exten- 

functional unit 24 may comprise an execution unit and an s,ons t0 the standard x86 instruction set. One example of a 

independent address generation unit. Such functional units 3DX instruction set extension is the 3DNow!™ extension 

may perform an address generation for conveyance to load/ from Advanced Mlcro Devices, Inc. MMX instructions are 

store unit 26 in parallel with the execution of an integer or is g ear fd toward multimedia and two-dimensional graphic 

branch operation. applications, while 3DX instructions are optimized for per- 

cuf*uxi i . j . forming three-dimensional graphic manipulations such as 

Each of the functional units 24 also provides information A ■ J# + . \ M . . 

i- #u *• e * * # • ,y , . UAUll ^ ailuu rendering and texture mapping. Many 3 DX instructions are 

regarding the execution of conditional branch instructions to . j • • *u ♦ f .u 

*u u u j . . ■ • * j j£ • , ... vectored instructions that perform the same operation on a 

the branch prediction unit 14. If a branch prediction was K _ rtf • . M - t • f A y 

t , r . j • . . a t r . . on number of independent pairs or operands, 

incorrect, branch prediction unit 14 flushes instructions u A ,. „ ... . . , 

subsequent to the mispredicted branch that have entered the M f he fij^ illustrates, this embodiment of FPU 36 

instruction processing pipeline, and causes fetch of the com P r ^ the ^wing components: a rename-1 unit 310, 

required instructions from instruction cache 16 or main 3 rcname :? 312 > a ^uler 314 ' a retire ^™}}«> a 

memory. It is noted that in such situations, results of f e ^ ter ■ ®* 318 ' a . ioad/£ f ™ l ™ P/P eluie 32 °: a 

instructions in the original program sequence which occur 25 load ^ u My execution pipeline 322, a load/store execution 

after the mispredicted branch instruction are discarded, ™> a skld bu ^5 fi 33 4, a convert and classify unit 

including those which were speculatively executed and 33 « and a load mapper 338. Rename-1 unit 310 is coupled 

temporarily stored in load/store unit 26 and reorder buffer J^T /™ * C0D ^ ured to a specu- 

32 lative top of stack (TOS) tag and tag word 352. Rename-2 

30 unit 312 is coupled to future file tag array 328, architectural 

Floating Point Unit ta S arra y 330, ret "* e queue 316, skid buffer 334, scheduler 

314, and load mapper 338. Convert and classify unit 336 is 

Results produced by functional units 24 are sent to reorder also coupled to load mapper 338, which in turn is coupled 
buffer 32 if a register value is being updated, and to t0 execution and pipeline control unit 340 along with 
load/store unit 26 if the contents of a memory location are 35 instruction status register file 342 and scheduler 314. Reg- 
changed. If the result is to be stored in a register, reorder ^tor file 318 receives inputs from convert and classify unit 
buffer 32 stores the result in the location reserved for the 336, load mapper 338 and scheduler 314, and outputs data 
value of the register when the instruction was decoded. A t o source operand bus 348. Source operand bus 348 is in turn 
plurality of result buses 38 are included for forwarding of coupled to execution pipelines 320, 322, and 326. Finally, 
results from functional units 24 and load/store unit 26. 40 execution pipelines 320, 322, and 326, and floating-point 
Result buses 38 convey the result generated, as well as the status/control/tag words 344 are all coupled to result bus 
reorder buffer tag identifying the instruction being executed. 346. while floating-point status/control/tag words 344 and 

Load/store unit 26 provides an interface between func- speculative top of stack and tag word 352 are shown 

tional units 24 and data cache 28. In one embodiment, separately in the figure for explanatory purposes, these tags 

load/store unit 26 is configured with a pre-cache load/store 45 may be stored together with future file tags 328 and archi- 

buffer having twelve storage locations for data and address tectural regisLer tags 330 in a single storage location, e.g., 

information for pending loads or stores and a post-cache within register file 318, execution and pipeline control unit 

load/store buffer having 32 entries. Decode units 20 arbitrate 340, or retire queue 316. 

for access to the load/store unit 26. When the buffer is full, Rename-1 unit 310 receives up to three instructions per 

a decode unit must wail until load/store unit 26 has room for 50 clock cycle. As previously noted, these may be any combi- 

the pending load or store request information. Load/store nat j on G f floating-point, MMX, or 3DX instructions, 

unit 26 also performs dependency checking for load memory Rename-1 unit 310 converts stack-relative registers into 

operations against pending store memory operations to absolute register numbers. For instructions with memory 

ensure that data coherency is maintained. A memory opera- operands, e.g., FLD instructions (floating-point load), a 

tion is a transfer of data between microprocessor and the 55 stack-relative reference (e.g., the destination ST(7)) is 

main memory subsystem. Memory operations may be the mapped to an absolute register number. Furthermore, in 

result of an instruction which utilizes an operand stored in some embodiments load-execute instructions such as FADD 

memory, or may be the result of a load/store instruction [ mem ] also need to source operands converted from top-of- 

which causes the data transfer but no other operation. s t a ck relative addressing to absolute addressing. Thus, in 

Additionally, load/store unit 26 may include a special reg- 60 some embodiments x87 type instructions (i.e., floating point 

ister storage for special registers such as the segment reg- instructions) go through the stack to absolute register trans- 

isters and other registers related to the address translation lation process, while MMX and 3DNow! instructions do not. 

mechanism defined by the x86 microprocessor architecture. The x86 instruction set and architecture defines eight 

Data cache 28 is a high speed cache memory provided to floating-point registers that are accessed in a stack-like 

temporarily store data being transferred between load/store 65 manner (i.e., relative to a top-of-stack pointer). Rename-1 

unit 26 and the main memory subsystem. In one unit 310 also assigns each instruction to one of three 

embodiment, data cache 28 has a capacity of storing up to execution pipelines, either load/store execution pipeline 
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326, load/add execution pipeline 320, or load/multiply 
execution pipeline 322 and, if necessary, converts each 
instruction to an internal format. 

Rename-2 unit 312 performs true register renaming. Upon 
receiving the instructions from rename-1 unit 310, rename-2 5 
unit 312 reads three register tags from a "free list" of the 
available registers stored within retire queue 316. Once the 
registers have been read, rename-2 unit 312 assigns one to 
the destination register of each instruction. To rename the 
source registers, rename-2 unit 312 indexes tag future file 30 
328 using the absolute register number for each source 
register. Tag future file 328 stores tags that identify which 
registers store the current speculative future state of each of 
the sixteen architectural registers in FPU 36. Similarly, 
architectural register tag file 330 stores tags which identify 15 
which registers within register file 318 store the current 
architectural (non-speculative) state of FPU 36. Note, of the 
sixteen registers that define FPU 36's state (architectural or 
speculative), eight are architectural registers (i.e., floating- 
point stack or MMX registers) and eight are micro- 20 
architectural registers (i.e., registers that store internal state 
information that is not generally accessible to the 
programmer). The old destination register tags are then read 
from the tag future file 328 and written to the tail of the free 
list. Finally, tag future file 328 is updated by storing tags for 25 
the new destination registers. 

Memory source operands may be handled by assigning 
them the same register tag as the destination register. This is 
because load data will be converted and directly written into 
the destination register when it is received from load/store 30 
unit 26. In the case of an FLD instruction, no further 
processing is required (except in certain exceptional cases), 
although the FLD instruction is still assigned to an execution 
pipeline for the purpose of handling exceptions and signal- 
ing completion to reorder buffer 32. 35 

Once the three instructions have passed through rename-1 
unit 310 and rename-2 unit 312, the instructions are repre- 
sented in a three operand format (i.e., first source operand, 
second source operand, and destination operand). While the 4Q 
first source operand is always a register operand, a bit in the 
opcode may be used to indicate whether the second operand 
is a register operand or a memory operand. 

From rename-2 unit 312 the instructions are passed to 
scheduler 314, where the three instructions are allocated a 45 
"line" of storage. If scheduler 314 is full, the instructions 
may be stored in skid buffer 334 until such time as there is 
room within scheduler 314. After receiving the three 
instructions, scheduler 314 snoops result bus 346 and source 
operand bus 348. Scheduler 314 may also snoop load data 50 
bus. Concurrently with allocating the line of storage and 
snooping, retire queue 316 allocates one entry for each 
instruction. The entries store the destination register tags, the 
absolute destination register number, and the old destination 
register tags. Additional information may also be included, 55 
e.g., information that may be needed to update the archi- 
tectural state at retire time. 

On the cycle following their entry into scheduler 314, the 
instructions are available for scheduling. Scheduler 314 
examines all of the stored instructions and issues the oldest 60 
instructions which meet the following criteria: (1) the execu- 
tion pipeline to which the instruction has been assigned is 
available, (2) the result bus for that execution pipeline will 
be available on the clock cycle in which the instruction will 
complete (this is dependent upon the latency of the particu- 65 
lar instruction), and (3) the instruction's source registers and 
or memory operands are available. In this embodiment, 
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scheduler 314 may schedule up to three instructions per 
clock cycle. Each of the three execution pipelines 320, 322, 
and 326 may accept a new instruction every clock cycle. 
Note other embodiments capable of scheduling four or more 
instructions are also possible and contemplated. 

Once all three entries in a line are scheduled, that line is 
free to be compacted out of scheduler 314. When the 
instructions are conveyed from scheduler 314 to their 
respective instruction execution pipeline, their source oper- 
ands are read. In some cases, the source data will come from 
a register, while in other cases the source data will come 
from a "bypass". A bypass refers to the practice of result 
forwarding or superforwarding. Result forwarding involves 
conveying the results from a recently executed instruction 
directly to other instructions that depend upon that result. 
Result forwarding allows the result to be used in subsequent 
instructions without having to wait for the result to be stored 
in a register and having to read the result from the register. 
Result superforwarding will be described in more detail 
below. 

Each execution pipeline 320, 322, and 326 may be 
configured as a four-stage pipeline. In the first stage of each 
pipeline, the result buses are read and the input data is taken 
from either the source operand bus (if the data is within 
register file 318) or the result bus (if a bypass is being 
performed). Once the source data has been received, each 
execution pipeline may begin performing the appropriate 
computation. 

In the embodiment shown in the figure, execution pipeline 
320 is configured to perform load and addition operations, 
execution pipeline 322 is configured to perform load and 
multiplication operations, and execution pipeline 326 is 
configured to perform load and store operations. Both 
execution pipelines 320 and 322 may be configured to 
perform certain MMX instructions. Execution pipeline 322, 
which comprises multiplier 50, may also be configured to 
perform iterative calculations that involve multiplication, 
e.g., reciprocal operations, division operations, and square 
root operations, under the control of control unit 354, 
division/square root ROM 360, and, if a remainder is called 
for, remainder control unit 356. Constant ROM 358 is a read 
only memory configured to store a plurality of constants for 
load constant instructions such as FLDPI, for transcendental 
computation, for FPU 36 self-checking, and for certain 
special and exceptional results. Division/square root ROM 
360 is a read only memory which stores constants used to 
determine initial values for division and square root com- 
putations and constants returned by certain 3DNow! instruc- 
tions. Control unit 354 provides sequence information for 
division and square root functions. Note, in some embodi- 
ments control unit 354 may be part of execution and pipeline 
control unit 340. 

In some cases, floating point instruction operands or 
floating point results generated by executing an instruction 
may be too small to fit within the operand or result's 
standard data format. These numbers are referred to as 
"denormals". While normalized floating-point values have a 
non-zero exponent and a one in the most significant bit of the 
significand, i.e., the bit directly to the left of the binary radix 
point (e.g., 1.001010 . . . ), denormals are represented with 
a zero exponent and a zero in the most significant bit of the 
significand (e.g., 0.000101 . . . ). Denormal load data is 
detected and tagged by convert and classify unit 336. 
Denormal results generated by during execution within 
execution pipelines 320, 322, and 326 are tagged when they 
are generated. Execution and pipeline control unit 340 
detects the presence of the denormal tags and calls an 
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appropriate microcode routine from MROM 34 to handle the 
denormal data. 

At the end of the final execution stage, the data is placed 
on result bus 346. This makes the result data available for an 
instruction entering the first stage of one of the instruction 
execution pipelines during the next clock cycle. Once the 
data is on the result bus, it may also be written into register 
file 318. Instead of being stored in register file 318, store 
data is sent to the load/store unit 26. In some cases, however, 
store data may be sent to both the load/store unit 26 and 
register file 318 (e.g., for floating point load control word 
instructions — FLDCW — and for stores if a denormal is 
being stored). The reorder buffer tag and any exception 
information is sent back to reorder buffer 32. At this point, 
the instructions are complete. However, they are still specu- 
lative. When the instructions reach the bottom of reorder 
buffer 32 (assuming there is no branch misprediction or 
exception abort), reorder buffer 32 notifies FPU 36 that the 
instructions should be retired. The speculative state of the 
floating-point unit 36 is committed to the architectural state 
when retire queue 316 updates the tags for the architectural 
register file 328, and the destination register tags in retire 
queue 316 are written to the architectural register file 318. 

Convert and classify unit 336 receives all load data, 
classifies it by data type, and converts it to an internal format 
if necessary. In one embodiment, convert and classify unit 
336 appends a three bit classification tag to each data item. 
The three bit tag classifies the accompanying data as one of 
the following eight potential data types: (1) zero, (2) infinity, 
(3) quiet NaN, (4) signaling NaN, (5) denormal, (6) MMX, 
(7) normal, or (8) unsupported. NaN is a standard abbrevia- 
tion for "Not-a-Number^'. While representations may vary 
across different implementations, zero data types are typi- 
cally represented with a zero exponent and a zero signifi- 
cand. Similarly, infinity data types are typically represented 
with an exponent comprising all asserted ones. A quiet NaN 
("QNaN") is generated whenever a floating-point instruction 
causes an invalid operation, e.g., a square root operation on 
a negative number. A signaling NaN ("SNaN"), unlike a 
quiet NaN, generates an exception when used. Signaling 
NaNs are not generated by FPU 36 and are typically only 
used by programmers to signal particular error conditions. 
The table below illustrates the characteristics of each data 
type for x86 compatible floating-point units (wherein "x" 
represents either a one or a zero): 



Sign 


Exponent 


Significant! 


Value 


X 


00 . 


• • oo 2 


0.00 . . 


. 00 2 


Zero 


X 


11 . . 


.11 2 


1.00 . . 


. oo 2 


Infinity 


X 


11 . . 


. 11 2 


l.lxx . 


. . xx 2 


QNaN 


X 


11 . . 


.11 2 


l.Oxx . 


. .xx 2 


SNaN 


X 


00 . 


■ -00 2 


O.xx . . 


. xx 2 


Denormal 



It is noted that these data types may conform to the IEEE- 
754 specification. 

MMX data types are 64 bits wide and comprise either 
eight packed 8-bit bytes, four packed 16-bit words, or two 
packed 32-bit double-words. MMX data types may be 
detected by the MMX opcodes which precede them. Normal 
data types are standard floating-point values that are either 
single precision, double precision, or extended precision 
(before they are translated into an internal data format) and 
that do not have the characteristics of any of the previously 
described data types. Unsupported data types are extended 
precision bit patterns that do not fall into any of the 
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previously described data types and that fall outside of the 
normal data type as defined by IEEE Standard 754. For 
example, an extended precision bit sequence having a 0 sign 
bit, a biased exponent of 11 ... 11, and a significand in the 
5 format (i.ff . if) of 0.11 ... 11 is an unsupported value 
(wherein "i" is an integer bit and wherein "f * is a fractional 
bit). Note, however, in other embodiments larger or smaller 
classification tags and additional or fewer data types may be 
implemented. 

1Q The data types and exemplary formats illustrated above 
describe the data as it is received and identified by convert 
and classify unit 336. Once convert and classify unit 336 
classifies the data, the classification tag may be used to 
identify some or all of the data's properties. For example, if 
a value is classified as a zero, it may be identified solely by 

15 its classification tag instead of having to perform a wide 
comparison of each bit in the exponent and significand 
portions of the value. The classification tags may accompany 
the data throughout FPU 36 and maybe stored in register file 
18 along with the data. 

20 As discussed above, when data from a load instruction is 
received by FPU 36, the data is routed to convert and 
classify unit 336. A corresponding reorder buffer tag accom- 
panies the data and is routed to load mapper 338. As 
previously noted in the description of microprocessor 10, the 

25 reorder buffer tag identifies the sequence in which out of 
order instructions should be retired (i.e., committed to 
architectural state). For load instructions, the reorder buffer 
tag follows the data from load/store unit 26 to FPU 36. Load 
mapper 338 receives the reorder buffer tag and translates it 

30 into a physical register tag. The physical register tag indi- 
cates which data register within register file 318 the corre- 
sponding data is to be loaded into. 

Execution and pipeline control unit 340 tracks the status 
of each stage in execution pipelines 320, 322, and 326. 

35 Execution and pipeline control unit 340 contains timing 
information enabling it to determine the future availability 
of each execution pipelines. For example, when an FMUL 
(floating-point multiply) instruction begins execution in 
multiplication execution pipeline 322, control unit 340 uses 

40 its stored timing information to notify scheduler 314 that the 
result will be available for output on result bus 346 four 
clock cycles later. This timing information allows scheduler 
314 to efficiently schedule instruction for execution in 
execution pipelines 320, 322, and 326. Control unit 340 also 

45 tracks the status of each pipe stage, receiving and prioritiz- 
ing exceptions from execution pipelines 320, 322, and 326. 

FPU status word, control word, and tag word 
(collectively, words 344) are stored within retire queue 316 
and indicate which of the status and control registers within 

50 register file 318 contain the FPU's current architectural state. 
For example, in one embodiment register file 318 may 
comprise 88 registers, i.e., 16 registers to store the current 
architectural state of FPU 36 (see below), and 72 speculative 
registers to store the speculative state of FPU 36. Of the 72 

55 speculative registers, 16 store the "current" speculative state. 
Of each set of 16 registers that store a particular state of FPU 
36 (whether the state is architectural or speculative), eight 
registers are FPU stack registers and eight registers are 
micro- architectural registers that store state information that 

60 is only accessible to microcode instructions within FPU 36, 
i.e., they are not accessible to the programmer and store only 
internal state information. In one embodiment, each register 
in register file 314 is 90 bits long, with 87 bits providing 
storage for internal format data (e.g., one sign bit, 18 

65 exponent bits, and a 68-bit significand) and 3 class bits. 
Instruction status register file 342 stores information for 
execution and pipeline control unit 340. As a result of 
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instructions executing out of order in FPU 36, exceptions the result generated assuming an overflow did not occur. The 

that occur within execution pipelines may need to be stored results from carry propagate adders 386 and 390 are con- 

until the instructions generating them are the next to be veyed to below precision fix up logic 392 and 394 respec- 

retired. Retire queue 316 reads instruction status register file tiv ely. Fix up logic 392 and 394 utilize the sticky bit from 

342 when it retires instructions and updates the architectural 5 carrv propagate adder 388 to perform corrections assuming 

floating-point status word (FPSW) and floating-point control an overflow occurred and an overflow did not occur, respec- 

word (FPCW) and tag word (collectively, 344) accordingly. tivelv - These results are provided to final result selection 

This information is conveyed to rename-1 unit along with lo & ic 402 via rouadi ng logic 396. Storage register 400 may 

the current speculative top of stack 352 and on result bus be ut 1 lhzed t0 store , the ne S ated x vei ? 10 L n ( e 8" l he one ' s 

3Q complement or two s complement) of the sum of the sum 

component and carry component from adder tree 380. 

Multiplier In one embodiment, multiplier 50 may be implemented in 

FIG. 5 illustrates details of one embodiment of multiplier slatic CM0S lo S ic ™d may have a multiplier latency of four 

50. In this embodiment, multiplier 50 comprises multiplex- c l c[ ^ } t may also be configured to operate on a maximum 

ers 370 and 372, adder 374, booth encoders 376, booth 1S °* 76 ' blt operands (i.e., excluding the exponent bits), 

multiplexers 378, adder tree 380, carry-save adders 382 and Advantageously this maximum width supports exactly 

icj „™™t» «aa„~. iqk ion n^'t,; fiv i • rounded 68-bit division. Multiplier 50 may also be config- 

w ^X^f* addcre f 86 i'2^S h « 7 u urcd t0 P erform al1 3DNow!™ SIMD FP multiplication 

392 and 394, s or age registers 39<^W)0, and final result m ioL As indicated in the figure, multiplier 50 is 

selection multiplexer 402. preferably implemented in a pipelined fashion to increase 

In response to receiving an instruction, multiplexers 370 20 overall instruction throughput. One example configuration 

and 372 are configured to select multiplier and multiplicand 0 f pipelines stages is shown in FIG. 5. Pipeline stage EX1 

operands according to state machine control signals 404A is the first stage of multiplier 50. In stage EX1, overlapping 

and 404B. A state machine (not shown in the figure) controls groups of four bits of the multiplier operands are inspected 

multiplier 50 and causes multiplexers 370 and 372 to select are per the booth space three multiplier algorithm. In 

operands appropriate for the particular instruction being 25 parallel, adder 374 generates the 3X multiple of the instruc- 

executed. For example, in a standard scalar floating point tion's multiplicand. Booth encoders 376 then generate 26 

multiplication, multiplexers 370 and 372 may be configured control signals which control the selection of the 26 booth 

to select a scalar multiplier operand and a scalar multipli- multiplexers (378) to form the appropriately-signed mul- 

cand operand from memory or other source (e.g. the floating u P lcs °f tne multiplicand. This is described in greater detail 

point unit's register stack). However, when performing other 30 bd° w - 

types of operations (e.g. square root and division operations) In the second pipeline stage, EX2, the 26 partial products 

multiplexers 370 and 372 may select operands from other are reduced to two using binary adder tree 380. The indi- 

sources such as division/square root ROM 360 or from local vidual four-two adders of binary tree 380 may also be 

bypassing register 396 and 400. The more complex multi- designed in static CMOS with single rail inputs and outputs, 

plication operations that entail the use of local bypass 35 While a static dual rail compressor with dual rail inputs and 

register 396 and 400 will be described in greater detail outputs is typically faster than a single rail configuration, it 

below. The selected multiplier and multiplicand operands is also larger and may require twice the routing resources. As 

are then routed to adder 374 and booth encoders 376 discussed further below, adder tree 380 may effectively be 

respectively. The output from booth encoders 376 is con- implemented as a parallelogram. However, several folding 

veyed to booth multiplexers 378 to select the corresponding 40 ancl interleaving techniques may be applied to result in a 

partial products. rectangular adder tree (i.e., thereby leaving internal wire 

The selected partial products are then routed to adder tree lengths constant). 
380 which comprises a plurality of carry-save adders. These At the end of the second pipeline stage EX2, the first 
carry-save adders sum the partial products and output a portion of the multiplier's rounding algorithm is applied, 
two-component result comprising (i) a carry component and 45 This algorithm involves adding a rounding constant to the 
(ii) a sum component. The carry and sum components are sum and parity outputs of adder tree 380. Since the normal- 
conveyed to carry-save adders 382 and 384. Carry-save ization of the assimilated result may be unknown at this 
adders 382 and 384 are configured to sum the carry and sum point, the addition is performed twice, i.e., both assuming no 
components with rounding constants (See FIG. 6) that overflow occurs and assuming an overflow does occur, in 
assume no overflow and overflow conditions respectively. 50 carry-save adders 382 and 384 respectively. 
Thus, carry-save adder 382 sums the carry result, the sum Advantageously, carry-save adders 382 and 384 may also be 
result, and the rounding constant assuming there is no used to implement part of the "back-multiply and subtract" 
overflow. Similarly, carry-save adder 384 sums the carry operation (referred to herein as the "BACKMUL" operation) 
result, the sum result, and a rounding constant assuming which forms the remainder required for quotient and square 
there is an overflow. Advantageously, by performing the 55 root rounding. 

calculation twice in parallel the effect of an overflow con- The rounding constants used by carry-save adders 382 
dition on processing time is minimized. Note, in some and 384 are shown in FIG. 6. Columns 420 through 426 
embodiments carry-save adders 382 and 384 form the final illustrate the rounding constants used for single precision, 
stage of adder tree 380. The results from carry-save adders double precision, extended precision, and internal precision, 
382 and 384 are routed in parallel to carry propagate adders 60 respectively. Rows 430 through 436 illustrate the rounding 
386 and 390, In parallel, the results from adder tree 380 are constants used for round-to- nearest, round-to-zero, round- 
passed to carry propagate adder 388, which includes sticky to-minus infinity, and round-to-plus infinity rounding 
bit logic to calculate a sticky bit used to round the results modes, respectively. Rows 438 through 442 show the round- 
from carry propagate adder 386 and 390. The results from ing constants for divisions/square root specific operations, 
carry propagate adder 388 are provided to final result 65 LASTMUL, ITERMUL, and BACKMUL operations, 
selection logic 402, which performs the task of selecting respectively. In FIG. 6, !x implies the bit inversion of x, and 
either the result generated assuming an overflow occurred or 24'bO implies 24 zero bits. 
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In the third pipeline stage EX3, three versions of the for adder tree 380. As shown in the figure, the adder tree may 

carry-assimilated results are formed. Two of these results are be perform multiply a 76-bit multiplier operand and a 76-bit 

rounded results assuming that either overflow or no overflow multiplicand operand, or a 24-32 bit packed floating point 

occurs. The third is the raw unrounded result. In parallel, multiplier and a 24-32 bit packed floating point multipli- 

sticky bit logic within adder 388 examines the low order bits 5 cand. 

of the sum and carry vectors, FIG. 8 illustrates more details of one embodiment of the 
In the fourth pipeline stage, EX4, the bits below the target multiplier from FIG. 7. In this embodiment, selection logic 
precision are appropriately cleared. While the rounding 450 is configured to allow multiplier 50 to select and 
constant applied in clock cycle EX2 ensures that the result multiply (a) two packed floating point operands, wherein 
is correctly rounded, it does not guarantee that the bits below 10 each packed floating point operand comprises two or more 
the target least significant bit (LSB) are cleared, as may be individual floating point values; or (b) two scalar floating 
required for x87 compatibility. The LSB is then condition- point operands, wherein each scalar operand comprises a 
ally inverted for regular multiplication operations (using the single floating point value. As noted above, selection logic 
round-to-nearest rounding mode) as a function of the sticky 450 also can be configured to select the results from adder 
bit to correctly handle the nearest-even case. Finally, the 15 tree 380 as an input for iterative multiplication and square 
most significant bit (MSB) of the unrounded result deter- root operations. As shown in the figure, selection logic 450 
mines whether or not overflow has occurred, and it selects comprises a plurality of multiplexers configured to select 
between the no overflow and the with overflow rounded from among the original source operands A and B, the 
results. For division and square root iterations, an extra results from a previous iterative operation, and constant 
result R,- is also provided. R, is the one's complement of the 20 ROM 358. As previously noted, state machine 452 corn- 
regular unrounded multiply result for division (and may also prises the control logic to direct multiplier 50 to perform 
be used as an approximation to 3-N+2 for the square root standard multiplication operations, packed multiplication 
calculations). Both of the results are available for local operations, and iterative calculations such as divide and 
storage and bypassing for use within division and square square root, 
root iterations. In the case of the last cycle of a division or 25 

square root instruction, the BACKMUL instruction is per- Rounding 

formed by appropriately rounding the results chosen from n ... 4 t , A .. „ . 

the previously computed target results; the quotient («Q»), *]?.. 9 l" ustr . at « s more ,. deta , lls of °° e t^?*"*" of 

the quotient plus one ("Q + l"), or the quotient minus one ml f P' ler 50 * nd ihe pipeune stages EX2-EX4. As shown 

(«0_l»\ 30 m me fig ure » tne results from adder tree 380 are conveyed to 

, , , , , . r carry-save adder ("CSA") 382, CSA 384, and carry- 

TTie unrounded result may also be used in the case of pr0 p a g a t e adder ("CPA") 388. CSA382 receives three input 

IEEE "tiny" results with the underflow exception masked. A operandSj the sum operand from adder tree 38()) lhe 

tiny number occurs when the computed rounded result is component of the result from adder tree 380, and third input 

below the minimum extended precision normal number. In selecled by mu i tiple xer 452A. Multiplexer 452A is config- 

this case, the unrounded result may be passed to a microcode ured l0 select either rounding constant 450A(as described in 

handling routine which may properly denorraalize and round the ta51e of FIG 6)> or the original A operand from 

the result for x87 compatibility. The unrounded result may tne fioating point instruction . Multiplexer 452B is config- 

also be used to determine whether round up has occurred for urcd t0 ^ eilher rounding constant 45 ob shifted one bit 

proper setting of the "CI" condition code bit in the floating (as described in the table of FIG . 6)> or the original 

point status word (i.e., bit 8). Roundup occurs when the A operand from the floati im ^^0^ also shifted 

rounded result differs from the unrounded result, and the CI one ^ 
bit can be set appropriately given this information. For 

division and square root operations, the unrounded result is Special Instructions 
synthesized by appropriately choosing between Q-l and Q. 

45 As previously noted, multiplier 50 may also be configured 

Early Completion to execute a number of special instructions. For example, 

f , j. n . ,. _ rt , , some of these instructions are components of iterative 

In some embodiments, multiplier 50 may advantageously 11. • uj-j j , i- l . c 

, c . . ' . • 1 . r f- • calculations such as divide and square root. Embodiments of 

be configured to complete particular types of operations in 4 . 1* n u j u ^ u 1 

c 11 1 i * j j :< . « , r these, special instructions will be described below, 

fewer clock cycles than standard operations. One example of 50 r 

this is division by a number that is exactly a power of two. Itermul ( A B) 
In theory, division by exact powers of two amounts to a 

simple reduction in the exponent leaving the significand This is a multiplication operation of A and B that forces 

unchanged. Accordingly, multiplier 50 may be configured to the rounding to be round to nearest. It assumes that each of 

support variable -latency instructions. For divide-by power- 55 the input operands are 76 bits wide and that the intermediate 

of-two operations, the exponent adjustment occurs during results (which is 152 bits wide) is rounded to 76 bits 

the initial approximation lookup phase, while the rounding precision. This wider precision may be useful to correct the 

of the significand occurs using the LASTMUL and BACK- uncorrected rounding errors which accumulate throughout 

MUL operations. Microprocessor 10 and multiplier 50 may the iterations, 

be configured such that the latency of these operations are 60 

acceptable with respect to the notification time to the float- Lastmul (A, B, PC) 

ing point unit's scheduler 314. The same approach may be This is a multiplication operation of A and B that forces 

utilized for multiplication by powers of two. the round i ng to be round to nearest. It performs the rounding 

FIGS 7-8 10 a P rec * s i° D 1 bit wider than the target precision specified 

65 by PC. For division, just prior to the rounding of this 

FIG. 7 illustrates a block diagram illustrating one embodi- operation, the double with product "Q" prime may be 

ment of multiplier 50 utilizing a parallelogram configuration required to be accurate to at least: 
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where Q is the infinitely precise quotient, and pc is the target 
precision in bits, wherein 1 ulp (unit in the last place) for 
such a pc bit number is 2~ (pc ~ 1 ^. Exact rounding may require 5 
that the result have an error no worse than ±0.5 ulp, or 2 31 
PC. Multiplier 50 may be configured to utilize a rounding 
algorithm that ensures that the result is computed to an 
accuracy of at least one additional bit. This is because the 
final quotient result may be in the range of (0.5,2). Thus, 1 1Q 
bit of normalization may be used, resulting in the potential 
need for the one additional bit of accuracy. For square root 
operations, just prior to the rounding the double-width 
product S 1 may be required to be accurate to at least: 

-2~ (fiC * 1 )<S-S'<2~ <J * : + li ' ^ 

wherein S is the infinitely precise square root, and pc is the 
target precision in bits. Since the square root result is in the 
range [1,2), it may have a looser constraint on the accuracy 
of the input to this operation. 20 

After rounding and normalizing to pc +1 bits through the 
LASTMUL op, the resulting value R" may satisfy: 

25 

wherein R is either the infinitely precise quotient or square 
root as appropriate. Thus, the value may have an error of 
(-0.5,+0.5) ulp with respect to the final pc bit number. 

Backmul(B,Q,A) 

30 

This is a multiplication operation that operates on two 
source operands B and Q, and it also accepts a third operand 
A. The 152-bit intermediate product of the two sources in 
carry-save form is added with an inverted version of the 
third operand, with the low-order bits filled with l's as 35 
shown in the table in FIG. 6. These three values are then 
input into rounding carry-save adders 382 and 384, with the 
unused lsb carry bit set to one, realizing the function of 
BxQ+TwosComp(A). This implements the negative version 
of the back multiply and subtraction operation to form the 40 
remainder, that is: 

BxQ-A 

is implemented instead of the desired 

45 

A-BxQ. 

The sign of this remainder is thus the negative of the true 
remainder sign. This operation returns two bits of status: 
whether the sign of the remainder is negative, taken from a 50 
high order bit of the result, and whether the remainder is 
exactly zero, using fast sticky-bit logic. Since Q could be 
rounded to any of four precisions, the high order bit is 
chosen high enough to allow it to suffice for all of the 
precisions. 55 

Compl(x) 

This operation returns the one's complement by bit inver- 
sion of the unrounded version of the currently-computed 
product. The unbiased exponent is forced to either -1 or 0 60 
depending upon whether the result should be in the binade 
[0.5,1) or [1,2). Using the one's complement instead of the 
two's complement in the iterations may add a small amount 
of error in some embodiments. However, this error may be 
taken into account when designing the width of the multi- 65 
plier and the required accuracy of any initial approximations 
(e.g., from constant ROM 360). 



20 

Comp3(x) 

This operation returns an approximation of (3-x)/2 for the 
unrounded version of the currently-computed product. This 
approximation may be formed by bit inversion and shifting. 

Round(qi,rem,pc,rc) 

This rounding function may be configured to assume that 
a biased trial result qi has been computed with pc +1 bits, 
which is known to have an error of (-0.5,+0.5) ulp with 
respect to pc bits. The extra bit, or guard bit, may be used 
along with the sign of the remainder, a bit stating whether 
the remainder is exactly zero, and re to choose which of the 
three possible results, q, q-1, or q+1 are equal to qi truncated 
to pc bits and decremented or incremented appropriately. 
One embodiment of the actions taken for the round function 
are shown in FIG. 10. The rounding details are shown in 
FIG. 10. In the figure, RN represents round-to-nearest, RNE 
represents round-to-nearest -even, RP represents round to 
positive infinity, RM represents round to minus infinity, and 
RZ represents round-to-zero. 

For RN, in the case of an exactly halfway situation, it may 
be necessary to inspect L, the least significant bit of q to 
determine the action. If L=0, then the result is correctly 
rounded to nearest-even. Otherwise, the result is incre- 
mented to the closest even result. It should be noted that for 
division where the precision of the input operands is the 
same as that of the result, the exact halfway case will not 
occur. Only when the result precision is smaller than the 
inputs can such a result occur and rounding be required. For 
the directed rounding modes RP and RM, the action may 
depend upon the sign of the quotient estimate. Those entries 
that contain two operations such as pos/neg are for the sign 
of the final result itself being positive and negative respec- 
tively. 

This function, along with the computation of q-1 and q+1, 
may be implemented in multiplier 50 in parallel with the 
BACKMUL operation. The status information from the 
BACKMUL op may be used as input to the ROUND 
function to quickly choose and return the correctly-rounded 
result. 

Additional details regarding a number of the features of 
different embodiments of multiplier 50 (e.g., the use of the 
special instructions, division, square root operations, and 
rounding) may be found in the parent applications cited at 
the beginning of this application. 

Computer System 

Turning now to FIG. 11, a block diagram of one embodi- 
ment of a computer system 500 configured to utilize one 
embodiment microprocessor 10 with multiplier 50 is shown. 
Microprocessor 10 is coupled to a variety of system com- 
ponents through a bus bridge 502. Other embodiments are 
possible and contemplated. In the depicted system, a main 
memory 504 is coupled to bus bridge 502 through a memory 
bus 506, and a graphics controller 508 is coupled to bus 
bridge 502 through an AGP bus 510. Finally, a plurality of 
PCI devices 512A-512B are coupled to bus bridge 502 
through a PCI bus 514. A secondary bus bridge 516 may 
further be provided to accommodate an electrical interface 
to one or more EISA or ISA devices 518 through an 
EISA/ISA bus 520. Microprocessor 10 is coupled to bus 
bridge 502 through a CPU bus 524. 

Bus bridge 502 provides an interface between micropro- 
cessor 10, main memory 504, graphics controller 508, and 
devices attached to PCI bus 514. When an operation is 
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received from one of the devices connected to bus bridge 
502, bus bridge 502 identifies the target of the operation (e.g. 
a particular device or, in the case of PCI bus 514, that the 
target is on PCI bus 514). Bus bridge 502 routes the 
operation to the targeted device. Bus bridge 502 generally 5 
translates an operation from the protocol used by the source 
device or bus to the protocol used by the target device or bus. 

In addition to providing an interface to an ISA/EISA bus 
for PCI bus 514, secondary bus bridge 516 may further 
incorporate additional functionality, as desired. For 3Q 
example, in one embodiment, secondary bus bridge 516 
includes a master PCI arbiter (not shown) for arbitrating 
ownership of PCI bus 514. An input/output controller (not 
shown), either external from or integrated with secondary 
bus bridge 516, may also be included within computer 15 
system 500 to provide operational support for a keyboard 
and mouse 522 and for various serial and parallel ports, as 
desired. An external cache unit (not shown) may further be 
coupled to CPU bus 524 between microprocessor 10 and bus 
bridge 502 in other embodiments. Alternatively, the external 2Q 
cache may be coupled to bus bridge 502 and cache control 
logic for the external cache may be integrated into bus 
bridge 502. 

Main memory 504 is a memory in which application 
programs are stored and from which microprocessor 10 25 
primarily executes. A suitable main memory 504 comprises 
DRAM (Dynamic Random Access Memory), and preferably 
a plurality of banks of SDRAM (Synchronous DRAM). 

PCI devices 512A-512B are illustrative of a variety of 
peripheral devices such as, for example, network interface 30 
cards, video accelerators, audtio cards, hard or floppy disk 
drives or drive controllers, SCSI (Small Computer Systems 
Interface) adapters and telephony cards. Similarly, ISA 
device 518 is illustrative of various types of peripheral 
devices, such as a modem, a sound card, and a variety of data 35 
acquisition cards such as GPIB or field bus interface cards. 

Graphics controller 508 is provided to control the render- 
ing of text and images on a display 526. Graphics controller 
508 may embody a typical graphics accelerator generally 
known in the art to render three-dimensional data structures 40 
which can be effectively shifted into and from main memory 
504. Graphics controller 508 may therefore be a master of 
AGP bus 510 in that it can request and receive access to a 
target interface within bus bridge 502 to thereby obtain 
access to main memory 504. A dedicated graphics bus 45 
accommodates rapid retrieval of data from main memory 
504. For certain operations, graphics controller 508 may 
further be configured to generate PCI protocol transactions 
on AGP bus 510. The AGP interface of bus bridge 502 may 
thus include functionality to support both AGP protocol 50 
transactions as well as PCI protocol target and initiator 
transactions. Display 526 is any electronic display upon 
which an image or text can be presented. A suitable display 
526 includes a cathode ray tube ("CRT"), a liquid crystal 
display ("LCD"), etc. 55 

It is noted that, while the AGP, PCI, and ISA or EISA 
buses have been used as examples in the above description, 
any bus architectures may be substituted as desired. It is 
further noted that computer system 500 may be a multipro- 
cessing computer system including additional microproces- 60 
sors (e.g. microprocessor 10a shown as an optional compo- 
nent of computer system 500). Microprocessor 10a may be 
similar to microprocessor 10. More particularly, micropro- 
cessor 10a may be an identical copy of microprocessor 10. 
Microprocessor 10a may share CPU bus 524 with micro- 65 
processor 10 or may be connected to bus bridge 502 via an 
independent bus. 



It is still further noted that the present discussion may 
refer to the assertion of various signals. As used herein, a 
signal is "asserted" if it conveys a value indicative of a 
particular condition. Conversely, a signal is "deasserted" if 
it conveys a value indicative of a lack of a particular 
condition. A signal may be defined to be asserted when it 
conveys a logical zero value or, conversely, when it conveys 
a logical one value. Additionally, various values have been 
described as being discarded in the above discussion. A 
value may be discarded in a number of manners, but 
generally involves modifying the value such that it is 
ignored by logic circuitry which receives the value. For 
example, if the value comprises a bit, the logic state of the 
value may be inverted to discard the value. If the value is an 
n-bit value, one of the n-bit encodings may indicate that the 
value is invalid. Setting the value to the invalid encoding 
causes the value to be discarded. Additionally, an n-bit value 
may include a valid bit indicative, when set, that the n-bit 
value is valid. Resetting the valid bit may comprise discard- 
ing the value. Other methods of discarding a value may be 
used as well. 

Although the embodiments above have been described in 
considerable detail, other versions are possible. Numerous 
variations and modifications will become apparent to those 
skilled in the art once the above disclosure is fully appre- 
ciated. It is intended that the following claims be interpreted 
to embrace all such variations and modifications. 

What is claimed is: 

1. A method for performing integer and floating point 
multiplication, the method comprising: 

detecting a type of multiplication operation to be per- 
formed; 

asserting one or more control signals in response to 
detecting the type of multiplication operation to be 
performed; 

conveying the asserted control signals to a multiplier and 
selection logic configurable to perform the multiplica- 
tion operation; 

selecting a multiplier operand and a multiplicand operand 
corresponding to the multiplication operation to be 
performed in response to the asserted control signals; 

multiplying the multiplier operand and the multiplicand 
operand, wherein the multiplication operation is 
selected from the group comprising: scalar floating 
point multiplication operations, iterative multiplication 
operations performed in the context of a divide 
operation, and iterative multiplication operations per- 
formed in the context of a square root operation; 

detecting floating point divide operations that have divi- 
sors that are exactly a power of two in response to 
detecting an iterative multiplication operation associ- 
ated with a divide operation; and 

executing the divide by power of two operation by: 
shifting dividends; and 

rounding the result to the appropriate precision using a 
specified rounding mode. 

2. A method for performing integer and floating point 
multiplication, the method comprising: 

delecting a type of multiplication operation to be per- 
formed; 

asserting one or more control signals in response to 
detecting the type of multiplication operation to be 
performed; 

conveying the asserted control signals to a multiplier and 
selection logic configurable to perform the multiplica- 
tion operation; 
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selecting a multiplier operand and a multiplicand operand 
corresponding to the multiplication operation to be 
performed in response to the asserted control signals; 

multiplying the multiplier operand and the multiplicand 
operand, wherein the multiplication operation is 5 
selected from the group comprising: scalar floating 
point multiplication operations, iterative multiplication 
operations performed in the context of a divide 
operation, and iterative multiplication operations per- 
formed in the context of a square root operation; 

wherein multiplying includes: 

generating a plurality of partial products; 
summing the partial products using an adder tree to 
generate a sum component and a carry component; 
summing the sum component and the carry component 1S 
assuming there is no overflow to generate a first 
result; 

summing the sum component and the carry component 
assuming there is an overflow to generate a second 
result; and 

selecting either the first result or the second result based 
on a most significant bit of either the first result or the 
second result. 

3. The method as recited in claim 2, wherein said multi- 
plying further comprises: 25 

selectively adding a rounding constant to the first result 
and the second result. 

4. A floating point unit comprising: 

a multiplier configured to perform scalar floating point 

multiplication of the form AxB, 30 
wherein the multiplier includes: 

a binary adder tree configured to generate a result 
having a sum component and a carry component; and 

an overflow adder coupled to the binary adder tree and 
configured to sum the sum component and the carry 35 
component from the adder tree with a first rounding 
constant assuming the resulting sum will be normal- 
ized and have an overflow; 

a non-overflow adder coupled to the binary adder tree 
and configured to sum the sum component and the 40 
carry component from the adder tree with a second 
rounding constant assuming the resulting sum will 
be normalized and not have an overflow; and 

a non-rounding adder coupled to the binary adder tree 
and configured to sum the sum component and the 45 
carry component from the adder tree assuming the 
result is not be rounded. 

5. The floating point unit as recited in claim 4, wherein the 
binary adder tree comprises a plurality of cascaded carry- 
save adders, and wherein each of the overflow adder, the 50 
non-overflow adder, and the non-rounding adder is a carry- 
propagate adder. 

6. The floating point unit as recited in claim 4, wherein the 
multiplier is further configured to perform scalar floating 
point operations of the form Xx Y-Z. wherein X, Y and Z are 55 
floating point operands, and wherein the product XxY is 
either (i) in the same binate as Z, or (ii) one bin ate greater 
than Z, wherein the multiplier further comprises one or more 
inverters configured to generate the one's complement of Z, 6Q 
wherein the overflow adder and the non-overflow adder are 
configured to receive and sum (a) the one's complement of 

Z, (b) the product XxY, and (c) a constant one. 

7. The floating point unit as recited in claim 6, wherein X 
and Y are equal for square root instructions. 65 

8. The floating point unit as recited in claim 6, wherein X 
and Y need not be equal for divide instructions. 



9. The floating point unit as recited in claim 4, further 
comprising selection logic configured to select the results 
from the non-rounding adder if the results from either the 
overflow adder or the non-overflow adder is determined to 
be below a predetermined minimum value. 

10. The floating point unit as recited in claim 9, wherein 
the multiplier is configured to call a microcode routine upon 
selecting the results from the non-rounding adder in 
response to a masked bit being set, wherein the microcode 
routine is configured to de normalize the results from the 
non-rounding adder, round the denormalized result, and then 
normalize the rounded denormalized result. 

11. The floating point unit as recited in claim 4 further 
comprising selection logic configured to select two source 
operands for the multiplier to multiply. 

12. The floating point unit as recited in claim 11, wherein 
the source operands are selected from at least the following: 

(a) original source operands including: two packed float- 
ing point values, wherein each packed floating point 
value comprises two or more individual floating point 
values, and two scalar floating point values; and 

(b) results stored in a result register. 

13. The floating point unit as recited in claim 12, wherein 
the multiplier further comprises a state machine coupled to 
the selection logic, wherein the state machine is configured 
to cause the selection logic to select the stored results in 
response to the multiplier executing a division instruction by 
performing iterative multiplication operations. 

14. The floating point unit as recited in claim 12, wherein 
the multiplier further comprises a state machine coupled to 
the selection logic, wherein the state machine is configured 
to cause the selection logic to select the stored results in 
response to the multiplier executing a square root instruction 
by performing iterative multiplication operations. 

15. The floating point unit as recited in claim 12, wherein 
the selection logic comprises a first multiplexer unit config- 
ured to select a multiplier operand for the multiplier, and a 
second multiplexer unit configured to select a multiplicand 
operand for the multiplier, wherein the first multiplexer unit 
is configured to select from at least the original source 
operands and the stored results. 

16. A computer system comprising: 
a processor; 

a main system memory coupled to the processor via a bus; 
wherein the processor includes a floating point unit 
including: 

a multiplier configured to perform floating point mul- 
tiplication of the form AxB; 

a register configured to store results generated by the 
multiplier; and 

selection logic configured to select source operands for 
the multiplier to multiply, 

wherein the multiplier includes: 

a binary adder tree configured to generate a result 
having a sum component and a carry component; 
and 

an overflow adder coupled to the binary adder tree 
and configured to sum the sum component and the 
carry component from the adder tree with a first 
rounding constant assuming the resulting sum will 
be normalized and have an overflow; 

a non-overflow adder coupled to the binary adder 
tree and configured to sum the sum component 
and the carry component from the adder tree with 
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a second rounding constant assuming the resulting 
sum will be normalized and not have an overflow; 
and 

a non-rounding adder coupled to the binary adder 
tree and configured to sum the sum component 
and the carry component from the adder tree 
assuming the result is not be rounded. 

17. The computer system as recited in claim 16, wherein 
the source operands are selected from at least the following: 

(a) original source operands including: two packed float- 
ing point values, wherein each packed floating point 
value comprises two or more individual floating point 
values, and two scalar floating point values; and 

(b) the results stored in the register. 

18. The computer system as recited in claim 17, wherein 
the multiplier further comprises a state machine coupled to 
the selection logic, wherein the state machine is configured 
to cause the selection logic to select the stored results in 
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response to the multiplier executing a division instruction by 
performing iterative multiplication operations. 

19, The computer system as recited in claim 17, wherein 
the multiplier further comprises a state machine coupled to 

5 the selection logic, wherein the state machine is configured 
to cause the selection logic to select the stored results in 
response to the multiplier executing a square root instruction 
by performing iterative multiplication operations. 

10 20. The computer system as recited in claim 16, wherein 
the selection logic comprises a first multiplexer unit config- 
ured to select a multiplier operand for the multiplier, and a 
second multiplexer unit configured to select a multiplicand 
operand for the multiplier, wherein the first multiplexer unit 

15 is configured to select from at least the original source 
operands and the stored results. 

* * * * * 
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ABSTRACT 



There is disclosed, for use in an x86-compatiblc processor, 
an interface circuit for synchronizing the transfer of signals 
between different clock domains derived from a common 
core clock, where the phase and frequency relationships 
between the different domain clocks are known. The inter- 
face circuit comprises 1) a first latch having a data input for 
receiving a data signal from the first clock domain, a clock 
input for receiving the first clock signal, and an output; 2) a 
second latch having a data input coupled to the first latch 
output, an enable input for receiving a gating signal, a clock 
input for receiving the first clock signal, and an output; 3) a 
third latch having a data input for receiving the data signal, 
an enable input for receiving a gating signal, a clock input 
for receiving the first clock signal, and an output; and 4) a 
multiplexer having a first data input coupled to the second 
latch output, a second data input coupled to the third latch 
output, and a selector input for selecting one of the first data 
input and the second data input for transfer to an output of 
the multiplexer. 

20 Claims, 6 Drawing Sheets 
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LOW-LATENCY CIRCUIT FOR 
SYNCHRONIZING DATA TRANSFERS 
BETWEEN CLOCK DOMAINS DERIVED 
FROM A COMMON CLOCK 

5 

CROSS-REFERENCE TO RELATED 
APPLICATION 

The present invention is related to that disclosed in U.S. 
patent application Ser. No. 09/477,488, filed concurrently 
herewith, entitled ALOW LATENCY CLOCK DOMAIN 
SYNCHRONIZATION CIRCUIT AND METHOD OF 
OPERATION. The above application is commonly assigned 
to the assignee of the present invention. The disclosure of the 
related patent application is hereby incorporated by refer- 15 
ence for all purposes as if fully set forth herein. 

TECHNICAL FIELD OF THE INVENTION 

The present invention is directed, in general, to micro- 
processors and, more specifically, to synchronization cir- 20 
cuits for transferring data between two different clock 
domains controlled by a processing device. 

BACKGROUND OF THE INVENTION 

The ever-growing requirement for high performance com- 25 
puters demands that state-of-the-art microprocessors 
execute instructions in the minimum amount of time. Over 
the years, efforts to increase microprocessor speeds have 
followed different approaches, including increasing the 
speed of the clock that drives the processor and reducing the 30 
number of clock cycles required to perform a given instruc- 
tion. 

Microprocessor speeds may also be increased by reducing 
the number of gate delays incurred while executing an 35 
operation. Under this approach, the microprocessor is 
designed so that each data bit or control signal propagates 
through the smallest possible number of gates when per- 
forming an operation. Additionally, the propagation delay 
through each individual gate is also minimized in order to 40 
further reduce the end-to-end propagation delay associated 
with transmitting a control signal or a data bit during the 
execution of an instruction. 

One area where it is important to minimize propagation 
delays occurs at the interface between clock domains. Con- 45 
ventional microprocessors contain many clock signals that 
are derived from a basic high-frequency core clock. The core 
clock signal may be divided down to produce clock signals 
that are related, for example, by an N:l ratio or by an 
(N+2):l ratio. For instance, dividing the core clock by two 50 
and dividing the core clock by four yields two clock signals 
that are in a 2:1 ratio. Similarly, dividing the core clock by 
two and dividing the core clock by five yields two clock 
signals that are in a 2.5:1 ratio. These different clock domain 
signals may drive internal microprocessor components or 55 
may be brought off-chip to drive external devices, such as 
main memory, input/output (I/O) buses, and the like. 

At the interface between two clock domains, there is no 
guarantee that a signal transmitted from a first clock domain 
will be synchronized with the clock in a second clock 60 
domain. Normally, synchronization between different clock 
domains is handled by a set of synchronizing flip-flops. A 
signal in a first clock domain is first registered in a flip-flop 
in the first clock domain. The output of that first flip-flop is 
then Adouble sampled® by two flip-flop in the second clock 65 
domain. Double sampling means that the output of the first 
flip-flop feeds the input of a second flip-flop clocked in the 
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second clock domain. The output of the second flip-flop 
feeds the input of a third flip-flop that also is clocked in the 
second clock domain. The output of this third flip-flop is 
properly synchronized with the second clock domain. An 
identical three flip-flop interface circuit is used to synchro- 
nize signals that are being transmitted in the reverse direc- 
tion (i.e., from the second clock domain to the first clock 
domain). This synchronizing circuit, along with grey code 
encoding of multi-bit signals provides a means for synchro- 
nizing two asynchronous clock domains. 

The chief drawback of the above-described flip-flop inter- 
face circuit is the fact that there are three gate propagation 
delays involved in transmitting a signal from one clock 
domain . to another clock domain. This necessarily slows 
down the operation of the microprocessor and/or an external 
device communicating wit the microprocessor, since the 
circuits in the receiving domain receive the transmitted 
signal only after at least three propagation delays. 

Therefore, there is a need in the art for improved micro- 
processor designs that maximize the throughput of a pro- 
cessor and any external devices communicating with the 
processor. In particular, there is a need in the art for 
improved circuits that interface signals between different 
clock domains. More particularly, there is a need for inter- 
face circuits that minimize the number of gate delays that 
affect a signal being transmitted from a faster clock domain 
to a slower clock domain, and vice versa. 

SUMMARY OF THE INVENTION 

The limitations inherent in the prior art described above 
are overcome by the present invention, which provides an 
interface circuit for synchronizing the transfer of data 
through an output port from a first clock domain driven by 
a first clock signal to a second clock domain driven by a 
second clock signal. In an advantageous embodiment, the 
interface circuit comprises 1) a first latch having a data input 
for receiving a data signal from the first clock domain, and 
enable input for receiving an enabling signal, a clock input 
for receiving the first clock signal, and an output; 2) a second 
latch having a data input coupled to the first latch output, a 
clock input for receiving a gating signal, a clock input for 
receiving the first clock signal, and an output; 3) a third latch 
having a data input for receiving the data signal, and enable 
input for receiving a phase sel3ect signal, a clock input for 
receiving the first clock signal, and an output; and 4) a 
multiplexer having a first data input coupled to the second 
latch output, a second data input coupled to the third latch 
output, and a selector input for selecting one of the first data 
input and the second data input for transfer to an output of 
the multiplexer. 

According to one embodiment of the present invention, 
the second clock signal and the first clock signal are derived 
from a common core clock. 

According to another embodiment of the present 
invention, a frequency of the second clock signal and a 
frequency of the first clock signal are in a ratio of N:l where 
N is an integer. 

According to still another embodiment of the present 
invention, a selection signal applied to the selector input 
selects the first data input of the multiplexer when a rising 
edge of the first clock signal is approximately in phase with 
a rising edge of the second clock signal. 

According to yet another embodiment of the present 
invention, a frequency of the second clock signal and a 
frequency of the first clock signal are in a ratio of (N+2):l 
where N is an integer. 
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According to a further embodiment of the present 
invention, a selection signal applied to the selector input 
selects the first data input of the multiplexer during one 
clock cycle of the second clock signal. 

The present invention may also be embodied as an 
interface circuit for synchronizing the transfer of data from 
an output of a state machine in a first clock domain driven 
by a first clock signal to a second clock domain driven by a 
second clock signal. In an advantageous embodiment, the 
state machine interface circuit comprises 1) a first latch 
having a data input for receiving the state machine output, 
a clock input for receiving the first clock signal, and an 
output; and 2) a second latch having a data input coupled to 
the first latch output, a clock input for receiving a gating 
signal, and an output coupled to an input of the state 
machine. 

According to one state machine interface embodiment of 
the present invention, the second clock signal and the first 
clock signal are derived from a common core clock. 

According to another state machine interface embodiment 
of the present invention, a frequency of the second clock 
signal and a frequency of the first clock signal are in a ratio 
of N:l where N is an integer. 

According to still another state machine interface embodi- 
ment of the present invention, a frequency of the second 
clock signal and a frequency of the first clock signal are in 
a ratio of (N+2): 1 where N is an integer. 

The foregoing has outlined rather broadly the features and 
technical advantages of the present invention so that those 
skilled in the art may better understand the detailed descrip- 
tion of the invention that follows. Additional features and 
advantages of the invention will be described hereinafter 
that form the subject of the claims of the invention. Those 
skilled in the art should appreciate that they may readily use 
the conception and the specific embodiment disclosed as a 
basis for modifying or designing other structures for carry- 
ing out the same purposes of the present invention. Those 
skilled in the art should also realize that such equivalent 
constructions do not depart from the spirit and scope of the 
invention in its broadest form. 

Before undertaking the DETAILED DESCRIPTION, it 
may be advantageous to set forth definitions of certain words 
and phrases used throughout this patent document: the terms 
Ainclude@ and Acomprise,@ as well as derivatives thereof, 
mean inclusion without limitation; the term Aor,@ is 
inclusive, meaning and/or; the phrases Aassociated with® 
and Aassociated therewith,® as well as derivatives thereof, 
may mean to include, be included within, interconnect with, 
contain, be contained within, connect to or with, couple to 
or with, be communicable with, cooperate with, interleave, 
juxtapose, be proximate to, be bound to or with, have, have 
a property of, or the like; and the term Acontroller® means 
any device, system or part thereof that controls at least one 
operation, such a device may be implemented in hardware, 
firmware or software, or some combination of at least two of 
the same. It should be noted that the functionality associated 
with any particular controller may be centralized or 
distributed, whether locally or remotely. Definitions for 
certain words and phrases are provided throughout this 
patent document, those of ordinary skill in the art should 
understand that in many, if not most instances, such defini- 
tions apply to prior, as well as future uses of such defined 
words and phrases. 

BRIEF DESCRIPTION OF THE DRAWINGS 

For a more complete understanding of the present 
invention, reference is now made to the following descrip- 
tions taken in conjunction with the accompanying drawings, 
in which: 
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FIG. 1 is a block diagram of an exemplary integrated 
processor system, including an integrated microprocessor in 
accordance with the principles of the present invention; 

FIG. 2 illustrates in more detail the exemplary integrated 
5 microprocessor in FIG. 1 in accordance with one embodi- 
ment of the present invention; 

FIG. 3 is a schematic diagram of a synchronization circuit 
for synchronizing the output of a state machine to a clock 
domain; 

10 FIG. 4 is a schematic diagram of a synchronization circuit 
for synchronizing the transfer of data between two asyn- 
chronous clock domains; 

FIG. 5 is a timing diagram illustrating the operations of 
the synchronization circuits illustrated in FIGS. 3 and 4 in 

15 accordance with an exemplary embodiment of the present 
invention; and 

FIG. 6 is a timing diagram illustrating the operations of 
the synchronization circuits illustrated in FIGS. 3 and 4 in 
accordance with an exemplary embodiment of the present 
20 invention. 

DETAILED DESCRIPTION 

FIGS. 1 through 6, discussed below, and the various 
embodiments used to describe the principles of the present 

25 invention in this patent document are by way of illustration 
only and should not be construed in any way to limit the 
scope of the invention. Those skilled in the art will under- 
stand that the principles of the present invention may be 
implemented in any suitably arranged integrated micropro- 

30 cessor. 

Integrated Processor System 

FIG. 1 is a block diagram of an exemplary integrated 
processor system, including integrated processor 100 in 
accordance with the principles of the present invention. 

35 Integrated microprocessor 100 includes central processing 
unit (CPU) 110, which has dual integer and dual floating 
point execution units, separate load/store and branch units, 
and LI instruction and data caches. Integrated onto the 
microprocessor die is graphics unit 120, system memory 

40 controller 130, and L2 cache 140, which is shared by CPU 
110 and graphics unit 120. Bus interface unit 150 interfaces 
CPU 110, graphics unit 120, and L2 cache 140 to memory 
controller 130. 

Integrated memory controller 130 bridges processor 100 

45 to system memory 160, and may provide data compression 
and/or decompression to reduce bus traffic over external 
memory bus 165 which preferably, although not exclusively, 
has a RAMbusJ, fast SDRAM or other type protocol. 
Integrated graphics unit 120 provides TFT, DSTN, RGB, 

50 and other types of video output to drive display 180. 

Bus interface unit 150 interfaces, through I/O interface 
152, processor 100 to chipset bridge 190 for conventional 
peripheral bus 192 connection (e.g., PCI connection) to 
peripherals, such as sound card 194, LAN controller 195, 

55 and disk drive 196, as well as fast serial link 198 (e.g., IEEE 
1394 "fire wire" bus and/or universal serial bus "USB") and 
relatively slow I/O port 199 for peripherals, such as a 
keyboard and/or a mouse. Alternatively, chipset bridge 160 
may integrate local bus functions such as sound, disk drive 

60 control, modem, network adapter, etc. 
Integrated CPU 

FIG. 2 illustrates in more detail the exemplary integrated 
processor 100, including CPU 110, which is integrated with 
graphics controller 120, memory controller 130, and L2 

65 unified cache 140 (e.g., 256 KB in size). CPU 110 includes 
an execution pipeline with instruction decode/dispatch logic 
200 and functional units 250. 
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Instruction decode/dispatch logic 200 decodes variable 
length x86 instructions into nodes (operations) each con- 
taining source, destination, and control logic. Each instruc- 
tion maps into one or more nodes, which are formed into 
checkpoints for issue in parallel to functional units 250. The 5 
exemplary execution pipeline includes dual integer units 
(EX) 255, dual pipelined floating point units (FP) 260, 
load/store unit (LDST) 265, and branch unit (BR) 270. 
Hence, a single checkpoint can include up to 2 EX, 2 FP, 1 
LDST, and 1 BR nodes which can be issued in parallel, LI 10 
data cache (DC) 280 (e.g., 16 KB in size) receives data 
requests from the LDST unit and, in the case of an LI hit, 
supplies the requested data to appropriate EX or FP unit. 

BR unit 270 executes branch operations based on flag 
results from the EX units. Predicted (taken/not-taken) and is 
not-predicted (undetected) branches are resolved (mis- 
predictions incur, for example, a 12 clock penalty) and 
branch information is supplied to BTB 275, including 
branch address, target address, and resolution (taken or not 
taken). BTB 275 includes a 1 KB target cache, a 7-bit history 20 
and prediction ROM, and a 16-entry return stack. 

Instruction decode/dispatch logic 200 includes LI instruc- 
tion cache (IC) 210 (e.g., 16 KB in size) which stores 
32-byte cache lines (8 dwords/4 q words). Each fetch 
operation, fetch unit 215 fetches a cache line of 32 instruc- 25 
tion bytes from the LI instruction cache to aligner logic 220. 
Fetch unit 215 either (a) generates a fetch address by 
incrementing the previous fetch address (sequential fetch) 
or, (b) if the previous fetch address hit in BTB 275, switches 
the code stream by supplying the fetch address for the cache 30 
line containing the target address provided by BTB 275. 
Fetch unit 215 supplies a linear address simultaneously to 
LI instruction cache 210 and BTB 275. A two-level trans- 
lation look- aside buffer (TLB) structure (a 32-entry LI 
instruction TLB and a 256-entry shared L2 TLB) supplies a 35 
corresponding physical address to the LI cache to complete 
cache access. 

Aligner logic 220 identifies up to two x86 variable length 
instructions per clock. Instructions are buffered in instruc- 
tion buffer 225, along with decode and issue constraints. 40 
Decoder 230 transfers instructions from the instruction 
buffer to the appropriate one (as determined by decode 
constraints stored with the instruction) of decoders DO, Dl, 
and Useq (a microsequencer). DO and Dl define two decode 
slots (or paths) SO and SI, with the Useq decoder feeding 45 
nodes into both slots simultaneously. 

DO and Dl each decode single node EX/FPU/BR instruc- 
tions that do not involve memory references (e.g., register- 
register integer and floating point operations and branch 
operations), while memory reference instructions, which 50 
decode into separate EX/FP and LDST nodes (e.g., register- 
memory integer and floating point operations), are con- 
strained to DO. The Useq decoder handles instructions that 
decode into more than two nodes/operations (e.g., far calls/ 
returns, irets, segment register loads, floating point divides, 55 
floating point transcendentals). Each such sequence of nodes 
are organized into one or more separate checkpoints issued 
in order to the functional units. Renaming logic 235 
(including a logical-to-physical map table) renames sources 
and destinations for each node, mapping logical to physical 60 
registers. 

Issue logic 240 organizes the renamed nodes from each 
slot into checkpoints that are scheduled for issue in order to 
the functional units. Most instructions can be dual issued 
with the nodes for each in the same checkpoint. Up to 16 65 
checkpoints may be active (i.e., issued to functional units). 
Nodes are issued into reservation stations in each functional 



unit. Once in the reservation stations, the nodes complete 
execution out-of-order. 

The dual EX0/EX1 (integer) units 255 are pipelined with 
separate copies of a physical register file, and execute and 
forward results in a single cycle. The dual FPU0/FPU1 units 
260 include dual execution units (with separate FP physical 
register files) that support MMX and 3DNow instructions, as 
well as standard x87 floating point, instruction execution. 
FPU0 includes a pipelined FAdder and FPU1 includes a 
pipelined Fmultipler, both supporting packed SIMD opera- 
tions. 

Integer multiply operations are issued to FPU1 with the 
Fmultiplier, and integer divide operations are issued as 
separate nodes to both FPU0 and FPU1, so that integer EX 
operations can execute in parallel with integer multiplies and 
divides. Results are forwarded between EX0/EX1 and 
FPU0/FPU1 in a single cycle. 

LDST unit 265 executes memory reference operations as 
loads/stores to/from data cache 280 (or L2 cache 140). 
LDST unit 265 performs pipelined linear address calculation 
and physical (paged) address translation, followed by data 
cache access with the physical (translated) address. Address 
translations are performed in order using a two-level TLB 
structure (a 32 entry LI data TLB and the 256 entry shared 
L2 TLB). Up to four pending LI misses can be outstanding. 
Missed data returns out of order (from either L2 cache 140 
or system memory 160). 

Exemplary 16 KB LI instruction cache 210 is single- 
ported 4-way associative, with 2 pending misses. Exemplary 
16 KB LI data cache 280 is non-blocking, dual-ported (one 
load port and one store/fill port), 4-way associative, with 4 
pending misses. Both LI caches are indexed with the linear 
address and physically tagged with the TLB (translated) 
address. In response to LI misses, L2 cache 140 transfers an 
entire cache line (32 bytes/256 bits) in one cycle with a 7 
clock access latency for LI misses that hit in L2 cache 140. 

Exemplary 256 KB L2 cache 140 is 8-way associative and 
8-way interleaved. Each interleave supports one LI (code/ 
data) miss per cycle, and either one LI store or one L2 fill 
per cycle. Portions or all of 2 of the 8 ways may be locked 
down for use by graphics controller 120. 

For integer register-to-register operations, the execution 
pipeline is eleven (11) stages from code fetch to completion: 
two cache access stages (IC1 and IC2), two alignment stages 
(AL1 and AL2), three decode/rename stages 
(DEC0-DEC2), checkpoint issue stage (ISS), and reserva- 
tion stage (RS), followed by the execute and result write- 
back/forward stages (EX and WB). For integer register- 
memory operations, the LDST unit pipeline adds an 
additional four stages between RS and EX: address calcu- 
lation (AC), translation (XL), and data cache access and 
drive back DC and DB. The floating point adder pipeline 
comprises four stages and the floating point multiply pipe- 
line comprises five stages. 

Different functional blocks in integrated processor 100 
may operate at different clock speeds. Each group of circuits 
that are driven at a specified clock speed is referred to as a 
clock domain. As described above in the Background, 
special synchronization circuitry is needed to transfer data 
from one clock domain to another clock domain. However, 
because all of the clock domains in integrated processor 100 
are derived from a common core clock, the phase and 
frequency relationships between the different clock domains 
are known. The present invention use knowledge of the 
phase and frequency relationships between clock domains to 
provide unique synchronization circuits that minimize the 
number of gates and clock delays encountered when trans- 
ferring data from one domain to another domain. 
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FIG. 3 is a schematic diagram of exemplary synchroni- 
zation circuit 300 for synchronizing the output of a state 
machine to a clock domain. Exemplary synchronization 
circuit 300 comprises latch 302, latch 304, inverter 306, 
inverter 307, AND gate 308, and state machine logic circuit 5 
310. The data input (D) of latch 302 is connected to the 
Anext state® output (NEXT) of state machine logic circuit 
310, and the enable input (EN) of latch 302 is permanently 
connected to a Logic 1 enabling signal. Latch 302 transfers 
NEXT to its Q output on the rising edge of CLK. 10 

The output of latch 302 is connected to the data (D) input 
of latch 304. Inverters 306 and 307 invert the CLK signal. 
The inverted CLK signal is one input to AND gate 308. The 
other input of AND gate 308 receives the PHASE signal. 
The output of AND gate 308 is a gated clock signal that is 15 
Ahigh@ (or Logic 1) when inverted CLK and PHASE are 
both high. The output of AND gate 308 is connected to the 
enable (EN) input of latch 304. Latch 304 transfers the 
clocked output of latch 302 to the Q output of latch 304 on 
the rising edge of the inverted CLK signal from inverter 307, 20 
providing an output which is synchronized with clock 
domain of the CLK domain. The Q output of latch 304 
represents the current state (CURRENT) which is connected 
as the input to Logic circuit 310. Since the CURRENT input 
to state machine logic circuit 310 is synchronized with the 25 
CLK signal, the NEXT output of state machine logic circuit 
310 is also synchronized with the CLK signal. 

FIG. 4 is a schematic diagram of exemplary synchroni- 
zation circuit 400 for synchronizing the transfer of data 
between two asynchronous clock domains. Synchronization 30 
circuit 400 transfers the DATA signal off-chip to another 
circuit connected to pin 430. Latches 402, 404, and 410 and 
multiplexer 412 form synchronizing circuit for an input data 
signal, labeled ADATA@ in FIG. 4. Latches 420, 422, and 
424 and multiplexer 426 form a synchronizing circuit for an 35 
input data enable signal, labeled AD ATA ENABLE® in 
FIG. 4, Inverter 406 and AND gate 408 provide a gated 
inverted clock signal for use by both synchronizing circuit 
groups. Inverter 428 and tri-state driver 414 provide means 
for transferring synchronized data during the high level of 40 
the DATA ENABLE signal from multiplexer 412 to pin 430. 

Latch 402 transfers the DATA signal from input D to 
output Q on the rising edge of the CLK signal. The enable 
(EN) input to latch 402 is connected to Logic 1. The output 
Q of latch 402 is connected to input D of latch 404. Inverter 45 
406 inverts CLK and supplies inverted CLK as an input to 
AND gate 408. The other input of AND gate 408 receives 
the signal labeled APHASE® in FIG. 4. The inverted CLK 
output from AND gate 408 is supplied as the enable (EN) 
input for latches 404, 410, 422, and 424. so 

Inverter 407 inverts the CLK signal and clocks latch 404. 
Latch 404 transfers the output of latch 402 to its output Q on 
the rising edge of the output from inverter 407. In a similar 
manner, latch 410 transfers the DATA signal from its D input 
to its Q output on the rising edge of the output of inverter 55 
407. The output of latches 404 and 410 are provided as data 
inputs to multiplexer 412. The phase-select signal, labeled 
APHASE SELECT® in FIG, 4 selects one of the two data 
inputs of multiplexers 412 and 426. Thus, multiplexer 412 
transfers the output of latch 404 to its output when PHASE 60 
SELECT is high and multiplexer 412 transfers the output of 
latch 410 to its output when PHASE SELECT is low. 

The output of multiplexer 412 is connected to the non- 
inverting input of tri-state driver 414. Inverter 428 inverts 
the output from multiplexer 426 and provides this as the 65 
inverted input to tri-state driver 414. Tri-state driver 414 
transfers the output of multiplexer 412 to its output when the 
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output of inverter 428 is low (Logic 0). Thus, tri-state driver 
414 transfers the output of multiplexer 412 to pin 430 when 
the output of multiplexer 426 is high. Otherwise, the tri-state 
driver 414 provides a high impedance to pin 430. 

As previously described, the synchronizing circuit com- 
posed of latches 420, 422, and 424, and multiplexer 426 
operates in the same manner as previously described for the 
DATA signal, except that the DATA ENABLE signal is 
transferred in place of the DATA signal. The Q outputs of 
latches 422 and 424 are provided as inputs to multiplexer 
426, with the PHASE SELECT signal controlling the output 
of multiplexer 426. Multiplexer 426 transfers the output of 
latch 422 to inverter 428 when PHASE SELECT is high and 
transfers the output of latch 424 to inverter 428 when 
PHASE SELECT is low. As previously discussed, tri-state 
driver 414 provide means for transferring the DATA signal 
from multiplexer 412 to pin 430 during the high level of 
DATA ENABLE signal from multiplexer 426. 

FIG. 5 is a timing diagram illustrating the operations of 
the synchronization circuits illustrated in FIGS. 3 and 4 in 
accordance with an exemplary embodiment of the present 
invention. The timing diagram shows the signals: CLOCK 
(labeled ACLK® in FIGS. 3 and 4), 2:1 CLOCK, PHASE, 
PHASE SELECT, DATA, DATA ENABLE, PIN-OUT, 
NEXT STATE, and STATE (labeled ACURRENT® in FIG. 
3). 

CLOCK is square wave in which high and low intervals 
(or pulses) are sequentially numbered. Even numbers rep- 
resent the low pulses of CLOCK and odd numbers represent 
the high pulses of CLOCK. An even and odd numbered pair 
of adjacent pulses represents a single cycle for CLOCK. The 
2:1 CLOCK time line represents a clock signal which is 
running at half the rate of CLOCK. For this example, 2:1 
CLOCK transitions to high or low when CLOCK transitions 
from low to high. The time line for PHASE depicts an 
inverse relationship to the 2:1 CLOCK time line (i.e., high 
when 2: 1 CLOCK is low and low when 2: 1 CLOCK is high). 
For the purposes of this example, PHASE SELECT is shown 
as always high. 

The DATA signal is only transferred to the output of 
multiplexer 412 when the PHASE signal is high. During 
pulses 3 and 4 (i.e., one cycle of CLOCK), the DATA signal 
goes low when PHASE is high. At the same time, during 
pulses 3 and 4 (i.e., one cycle of CLOCK), the DATA 
ENABLE signal goes high and is clocked through to tri-state 
driver 414. Thus, the DATA signal is driven through to 
PIN-OUT which goes from high to low. Subsequently, 
during pulses 5 through 12, the DATA signal goes high 
again. However, the PHASE does not go high again until 
pulses 7 and 8. During pulses 7 and 8, the high DATA signal 
is driven through latch 404 and multiplexer 412 to tri-state 
driver 414. Since DATA ENABLE signal is still held high by 
latch 422, tri-state driver 414 is still enabled. Thus, the 
DATA signal is driven through to PIN-OUT, which goes 
from low to high. Another exemplary pulse of the DATA 
signal is driven through to PIN OUT during pulses 17-20. 

In FIG. 3, the output of latch 304, labeled ACURRENT® 
in FIG. 3 and ASTATE® in FIG. 5, can only change when 
PHASE is high and CLOCK is low (i.e., pulses 4, 8, 12, 16, 
etc.). Thus, STATE transitions to State 0 during pulse 4, to 
State 1 during pulse 8, to State 2 during pulse 12, and finally 
back to State 0 during pulse 16. 

FIG. 6 is a timing diagram illustrating the operations of 
the synchronization circuits illustrated in FIGS. 3 and 4 in 
accordance with an exemplary embodiment of the present 
invention. For this example, CLOCK is 2.5 times faster than 
5:2 CLOCK, with the positive transition of 5:2 CLOCK 
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coinciding with the beginning of every fifth half-cycle of 
CLOCK. The PHASE signal«s high interval always begins 
and ends with a falling edge of CLOCK and it remains high 
for one CLOCK cycle. PHASE SELECT essentially repre- 
sents a 5: 1 CLOCK which makes its transitions on the rising 5 
edge of the 5:2 CLOCK. In other words, PHASE SELECT 
cycles at half the rate of 5:2 CLOCK and one fifth the rate 
of CLOCK. 

As in FIG. 5, the DATA signal is only transferred to the 
output of multiplexer 412 when the PHASE signal is high. 
Latches 404 and 410 are clocked and transfer data from 
input to output when PHASE is high and CLOCK is low. 
Latches 422 and 424 are clocked by the inverted CLOCK 
and transfer data from input to output when PHASE 
SELECT is high. PHASE SELECT is used to select the 
output of multiplexers 412 and 426 so that the PIN OUT 35 
signal is synchronized to the domain of the 5:2 clock signal. 

Although the present invention has been described in 
detail, those skilled in the art should understand that they can 
make various changes, substitutions and alterations herein 
without departing from the spirit and scope of the invention 20 
in its broadest form. 

What is claimed is: 

1. An interface circuit for synchronizing the transfer of 
data through an output port from a first clock domain driven 
by a first clock signal to a second clock domain driven by a 2 s 
second clock signal, the interface circuit comprising: 

a first latch having a data input for receiving a data signal 
from said first clock domain, an enable input for 
receiving said first clock signal, a clock input for 
receiving said first clock signal; and an output; 3Q 

a second latch having a data input coupled to said first 
latch output, a clock input for receiving a gating signal, 
a clock input for receiving said first clock signal, and an 
output; 

a third latch having a data input for receiving said data 
signal, an enable input for receiving a phase select 
signal, a clock input for receiving said first clock signal, 
and an output; and 

a multiplexer having a first data input coupled to said 
second latch output, a second data input coupled to said 
third latch output, and a selector input for selecting one 40 
of said first data input and said second data input for 
transfer to an output of said multiplexer. 

2. The interface circuit set forth in claim 1 wherein said 
second clock signal and said first clock signal are derived 
from a common core clock. 45 

3. The interface circuit set forth in claim 2 wherein a 
frequency of said second clock signal and a frequency of 
said first clock signal are in a ratio of N:l where N is an 
integer. 

4. The interface circuit set forth in claim 3 wherein a 50 
selection signal applied to the selector input selects said first 
data input of said multiplexer when a rising edge of said first 
clock signal is approximately in phase with a rising edge of 
said second clock signal. 

5. The interface circuit set forth in claim 2 wherein a 55 
frequency of said second clock signal and a frequency of 
said first clock signal are in a ratio of (N+2):l where N is an 
integer. 

6. The interface circuit set forth in claim 5 wherein a 
selection signal applied to the selector input selects said first 60 
data input of said multiplexer during one clock cycle of said 
second clock signal. 

7. An interface circuit for synchronizing the transfer of 
data from an output of a state machine in a first clock domain 
driven by a first clock signal to a second clock domain 65 
driven by a second clock signal, the interface circuit com- 
prising: 
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a first latch having a data input for receiving said state 
machine output, an enable input that is set to an enabled 
value, and an output; and 

a second latch having a data input coupled to said first 
latch output, an enable input for receiving a gating 
signal, a clock input for receiving said first clock signal, 
and an output coupled to an input of said state machine. 

8. The interface circuit set forth in claim 7 wherein said 
second clock signal and said first clock signal are derived 
from a common core clock. 

9. The interface circuit set forth in claim 8 wherein a 
frequency of said second clock signal and a frequency of 
said first clock signal are in a ratio of N:l where N is an 
integer. 

10. The interface circuit set forth in claim 8 wherein a 
frequency of said second clock signal and a frequency of 
said first clock signal are in a ratio of (N+2):l where N is an 
integer. 

11. A computer system comprising: 

a pipelined, x86-compatible processor having dual integer 
and dual floating point execution units, separate load/ 
store and branch units, an LI instruction cache and an 
LI data cache; 

system memory for storing data or instructions; 
a core clock; and 

an interface circuit for synchronizing the transfer of data 
through an output port from a first clock domain driven 
by a first clock signal to a second clock domain driven 
by a second clock signal, the interface circuit compris- 
ing: 

a first latch having a data input for receiving a data 
signal from said first clock domain, a clock input for 
receiving said first clock signal, an enable input that 
is set to an enabled value and an output; 

a second latch having a data input coupled to said first 
latch output, an enable input for receiving a gating 
signal, a clock input for receiving said first clock 
signal, and an output; 

a third latch having a data input for receiving said data 
signal, a enable input for receiving a phase select 
signal, a clock input for receiving said first clock 
signal, and an output; and 

a multiplexer having a first data input coupled to said 
second latch output, a second data input coupled to 
said third latch output, and a selector input for 
selecting one of said first data input and said second 
data input for transfer to an output of said multi- 
plexer. 

12. The computer system set forth in claim 11 wherein 
said second clock signal and said first clock signal are 
derived from said core clock. 

13. The computer system set forth in claim 12 wherein a 
frequency of said second clock signal and a frequency of 
said first clock signal are in a ratio of N: 1 where N is an 
integer. 

14. The computer system set forth in claim 13 wherein a 
selection signal applied to the selector input selects said first 
data input of said multiplexer when a rising edge of said first 
clock signal is approximately in phase with a rising edge of 
said second clock signal. 

15. The computer system set forth in claim 12 wherein a 
frequency of said second clock signal and a frequency of 
said first clock signal are in a ratio of (N+2):l where N is an 
integer. 

16. The computer system set forth in claim 15 wherein a 
selection signal applied to the selector input selects said first 
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data input of said multiplexer during one clock cycle of said 
second clock signal. 

17. A computer system comprising: 
a pipelined, x86-compatible processor having dual integer 
and dual floating point execution units, separate load/ 5 
store and branch units, an LI instruction cache and an 
LI data cache; 
system memory for storing data or instructions; 
a core clock; and 3Q 
an interface circuit for synchronizing the transfer of data 
from an output of a state machine in a first clock 
domain driven by a first clock signal to a second clock 
domain driven by a second clock signal, the interface 
circuit comprising: 1S 
a first latch having a data input for receiving said state 
machine output, a clock input for receiving said first 
clock signal, an enable input set to an enabled value, 
and an output; and 
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a second latch having a data input coupled to said first 
latch output, an enable input for receiving a gating 
signal, a clock input for receiving said first clock 
signal, and an output coupled to an input of said state 
machine. 

18. The computer system set forth in claim 17 wherein 
said second clock signal and said first clock signal are 
derived from a common core clock. 

19. The computer system set forth in claim 18 wherein a 
frequency of said second clock signal and a frequency of 
said first clock signal are in a ratio of N:l where N is an 
integer. 

20. The computer system set forth in claim 18 wherein a 
frequency of said second clock signal and a frequency of 
said first clock signal are in a ratio of (N+2):l where N is an 
integer. 
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ABSTRACT 



An arithmetic engine includes a first dual multiplier accu- 
mulator (MAC) for receiving input data and for producing 
first dual MAC output data. A second dual MAC is coupled 
in parallel to the first dual MAC. The second dual MAC 
receives the input data and produces second dual MAC 
output data. An adder array is coupled to both the first dual 
MAC and to the second dual MAC. The adder array receives 
the input data, the first dual MAC output data, and the 
second dual MAC output data and produces arithmetic 
engine output data. Each dual MAC comprises a multiplier 
cross point switch, multiplier registers, a register selector, 
and parallel multipliers. Each adder array comprises a cross 
point switch, adder registers, a register selector, adder, and 
condition code determiner. 
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ARITHMETIC ENGINE WITH DUAL 
MULTIPLIER ACCUMULATOR DEVICES 

This application is a continuation of prioT application 
Ser. No. 08/170,145, filed Dec. 20, 1993. 5 

FIELD OF THE INVENTION 

This invention relates in general to digital signal process- 
ing and in particular to arithmetic engines for such applica- 10 
tions. 

BACKGROUND OF THE INVENTION 

High speed digital signal processing (DSP) in communi- 15 
cation system applications require low power implementa- 
tions of high performance DSP capabilities, especially for 
hand held devices. These DSPs must be capable of effi- 
ciently performing the spectral analysis and digital filtering 
required for hand-held spread spectrum communication 20 
equipment, high speed modems and all digital radios. These 
applications require matrix math operations, complex arith- 
metic operations, Fast Fourier Transform (FFT) calcula- 
tions, encoding/decoding, searching and sorting. Optimiza- 
tion requires that the DSP arithmetic engine use many 
resources efficiently. For example, for execution of the 
complex multiply operation: 



25 



(jcl+ryl)U2+(y2)=(xl^-yly2>fi(xly2+yljc2) 



30 



four real multiplications and two real additions are required 
to perform this operation in one instruction. More multipli- 
ers and adders would not improve the speed, less would 
require more instructions. 

What is needed is a high performance arithmetic engine 35 
that can be optimized to perform higher radix FFTs, complex 
multiplication and high speed data sorting. Higher radix 
FFTs require less memory accesses because intermediate 
results are stored in registers in the arithmetic engine. The 
throughput is therefore faster because the arithmetic engine 40 
can be pipelined. A radix 8 4096 point FFT requires 2 h less 
data fetches than a radix 2 FFT and less than a radix 4. The 
complex multiply is the basis for most DSP algorithms and 
is therefore an important performance target. Sorting is 
required to do statistical filtering and interpret results. Sort- 45 
ing can determine the location of peaks, depressions and 
statistically variant data which are typical signal analysis 
objectives. Such an arithmetic engine requires many 
resources which must be used efficiently because of size and 
power limitations. 50 

Typical arithmetic engines for general purpose DSPs 
include one multiplier and one adder referred to as a 
multiplier-accumulator (MAC). Arithmetic engines of DSPs 
that are more application specific can contain an array of 
several multipliers and adders. The arithmetic engines of the 55 
latter DSPs are typically optimized to perform a Radix-4 
butterfly and cannot do high speed sorting. 

Typical arithmetic engines are also not capable of effi- 
ciently handling the arithmetic computations required for all 
digital communications applications. General purpose DSPs 60 
are capable of performing Radix 2 butterflies efficiently but 
can not do higher radix butterflies because they do not have 
sufficient resources. The FFT requires many complex mul- 
tiplications, and each complex multiply requires four real 
multiplications and two real additions. Since general pur- 65 
pose DSPs have only one MAC, it takes four passes through 
the arithmetic engine of such a DSP to perform just one 



complex multiply. Engines which have more than one MAC 
usually limit the resources to Radix 4 butterfly operations. 

Topical arithmetic engines are also not capable of han- 
dling high speed data sorting. A typical data sort requires 
comparison of two pieces of data which results in a condi- 
tion code. The condition code is passed to the execution unit 
which then causes the program to branch in one of two paths 
on a subsequent instruction. The next instruction after the 
branch moves the data to its new location based on the 
results of the data comparison. This method requires a 
minimum of three instructions and can only operate on one 
data pair at a time. Typical arithmetic engines do not have 
sufficient data path switching to perform hardware sorts on 
multiple data pairs. 

SUMMARY OF THE INVENTION 

Accordingly, it is an advantage of the present invention to 
provide a new and improved arithmetic engine. It is an 
additional advantage that the arithmetic engine operates on 
complex data in ordered pairs and is optimized for Radix 8 
processing. It is still a further advantage of the invention that 
the arithmetic engine provides high speed data sorting, using 
full computing resources by recirculating data within the 
engine by feeding outputs back to inputs. 

To achieve these advantages, an arithmetic engine is 
contemplated which includes a first dual multiplier accumu- 
lator (MAC) for receiving input data at a first dual MAC first 
input and for producing first dual MAC output data at a first 
dual MAC output. A second dual MAC is coupled in parallel 
to the first dual MAC. The second dual MAC receives the 
input data at a second dual MAC first input and produces 
second dual MAC output data at a second dual MAC output. 
An adder array is coupled to both the first dual MAC and to 
the second dual MAC. The adder array is for receiving the 
input data, the first dual MAC output data, and the second 
dual MAC output data and for producing arithmetic engine 
output data at an adder array output. 

The above and other features and advantages of the 
present invention will be better understood from the follow- 
ing detailed description taken in conjunction with the 
accompanying drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

In FIG, 1, there is shown a schematic of an arithmetic 
engine in a complex arithmetic processor (CAP) or a DSP in 
accordance with a preferred embodiment of the invention; 

In FIG. 2, them is shown a schematic of a dual multiplier 
accumulator (MAC) suitable for use in the arithmetic engine 
shown in FIG. 1; 

In FIG. 3, there is shown a first pair of adders suitable for 
the adder array shown in FIG, 1; 

In FIG. 4, there is shown a second pair of adders suitable 
for the adder array shown in FIG. 1; and 

In FIG. 5, there is shown a third pair of adders suitable for 
the adder array shown in FIG. 1. 

DETAILED DESCRIPTION OF THE DRAWINGS 

Generally, the present invention provides an arithmetic 
engine apparatus optimized for complex arithmetic process- 
ing. While the arithmetic engine described is well suited for 
DSP applications, the use of such an arithmetic engine is not 
limited to DSP or complex arithmetic processing. The arith- 
metic invention may be accomplished by having two dual 
MACs in parallel, each connected to an adder array. Each 
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dual MAC can perform two multiplies and two addition 
operations simultaneously and can also sort its input data. 
The adder array can perform six addition operations simul- 
taneously and is also capable of performing sorting and 
storing of peak values of input data. 5 

The present invention can be more fully understood with 
reference to the figures. FIG. 1 illustrates a arithmetic engine 
11 within a complex arithmetic processor (CAP) or digital 
signal processor PSP) 10. The arithmetic engine 11 
includes a dual multiplier accumulator (dual MAC) 12, a I0 
dual MAC 13, and an adder array 14. The adder array 14 is 
coupled to the parallel combination of dual MACs 12 and 
13. First dual MAC first input 16 and second dual MAC first 
input 17 are input lines that provide data to the dual MACs 
from the input data streams. First dual MAC second input 22 
and second dual MAC second input 23 are input lines that 15 
can provide data to the respective dual MACs either from the 
input data streams, or from the adder array via first adder 
pair output data 74. First adder pair output data 74 is a 
sub-set of the output of the adder array 20. First dual MAC 
outputs 21 and second dual MAC outputs 19 are outputs of 20 
dual MAC 12 and dual MAC 13, respectively. These dual 
MAC outputs are inputs to the adder array 14 and are 
elements of the output data streams. The adder array 14 can 
also receive data directly from the input data streams. 

FIG. 2 illustrates a schematic of a dual MAC suitable for 
use in the arithmetic engine in FIG. 1. The dual MAC in 
FIG. 2 describes specifically the dual MAC 12 in FIG. 1 by 
showing the inputs as first dual MAC first input 16, first dual 
MAC second input 22, and the output as' first dual MAC 3Q 
outputs 21. FIG. 2. could represent dual MAC 13 in FIG. 1 
if first dual MAC first input 16 and first dual MAC second 
input 22 were replaced by second dual MAC first input 17 
and second dual MAC second input 23, respectively. The 
output consisting of first dual MAC outputs 21 would, in 35 
such a case, be replaced by second dual MAC outputs 19. 

The dual MAC in FIG. 2 includes a dual multiplier 30 and 
a dual accumulator 34, The dual multiplier 30 comprises a 
multiplier cross point switch 42, multiplier registers 46, 
register selector SO, first parallel multiplier 52 and second 40 
parallel multiplier 54. The cross point switch 42 allows any 
combination of inputs to be loaded into the multiplier 
registers 46 simultaneously. These registers are coupled to 
the register selector 50 which directs the appropriate mul- 
tiplier register values 48 to both the first parallel multiplier 45 
52, and second parallel multiplier 54. Ibe outputs of the 
multipliers are coupled to both the MAC output selector 66 
and the adder input registers 57 of the dual accumulator 34. 

Dual accumulator 34 comprises accumulator registers 56, 
adder input registers 57, an accumulator crosspoint switch 50 
62, first MAC adder 59, second MAC adder 61, condition 
code 63, and MAC output selector 66. The accumulator 
registers 56 are used to store the sum of previous MAC 
adder outputs which are first accumulator output 58 and 
second accumulator output 60 in FIG. 2. The adder input 55 
registers 57 are used to store the outputs of the first and 
second parallel multiplier outputs 36 and 38 of the dual 
multiplier 30. Both the accumulator registers 56 and the 
adder input registers 57 are coupled to the accumulator 
crosspoint switch 62. The adder input registers 57 are also 60 
coupled to the MAC output selector 66. The accumulator 
crosspoint switch 62 directs combinations of its inputs to 
both the first MAC adder 59 and the second MAC adder 61. 
If the dual accumulator 34 is in accumulate mode, the 
outputs of the first and second MAC adders 59 and 61 are 65 
feedback to the accumulator registers 56 to store their 
outputs. The first and second MAC adders 59 and 61 are also 



4 

coupled to the MAC output selector 66. The second MAC 
adder 61, which can also be used as a subtractor, is also 
coupled to the condition code 63. The condition code 63 
determines which of the second MAC adder 61 inputs is 
greater and uses this information to control the MAC output 
selector 66 which is coupled to its output. The MAC output 
selector 66 receives data from the first and second parallel 
multiplier outputs 36 and 38, the adder input register values 
51, and from the first and second accumulator output 58 and 
60. The current mode of the dual MAC determines which of 
these outputs to send out on the first dual MAC outputs 21 
in FIG. 2. If the dual MAC is in the sort mode, the condition 
code output 64 is used to determine the appropriate order for 
the output of the adder input register values 51 on the first 
dual MAC outputs 21. 

To explain the workings of the FIG. 2 representation in 
further detail, data can enter the multiplier cross point switch 
42 from the input data streams or from the first adder pair 
output data 74. The input data streams consist of data pairs 
that come from the complex arithmetic processor memories. 
The first adder pair output data 74 comes from the first rank 
of adders inside the adder array 14. The input data streams 
and the first adder pair output data enter the multiplier cross 
point switch 42 via the first dual MAC first input 16, and the 
first dual MAC second input 22, respectively. The multiplier 
cross point switch 42 routes the input data to the multiplier 
registers,46 as selected sources 44. The cross point switch 42 
is capable of moving multiple input values to multiple 
multiplier registers simultaneously. The register selector 50 
receives the multiplier register values 48 from the multiplier 
registers 46 and sends the appropriate values to the first and 
second parallel multipliers 52 and 54 as selected multiplier 
operands 53. The first and second parallel multipliers 52 and 
54 multiply their respective inputs and output their results on 
first parallel multiplier output 36 and second parallel mul- 
tiplier output 38. If the dual MAC is in bypass mode, these 
outputs are sent directly to the MAC output selector 66 
where they are sent out of the dual MAC via first dual MAC 
outputs 21. If the dual MAC is in arithmetic mode, the first 
multiplier output 36 and the second multiplier output 38 are 
sent to the dual accumulator 34 and stored in the adder input 
registers 57. 

After data is stored in the input registers 57 in FIG. 2, 
several things can happen. If the dual MAC is in arithmetic 
mode, the accumulator crosspoint switch 62 takes the two 
adder input registers values 51 and sends them both to the 
first MAC adder 59 and the second MAC adder 61 via the 
selected adder inputs 65. In such a case, the second MAC 
adder 61 is being used as a subtractor, resulting in the sum 
and difference of the two values. The sum and difference 
appear on the first accumulator output 58 and the second 
accumulator output 60, respectively, and enter the MAC 
output selector 66, before exiting the dual MAC via the first 
dual MAC outputs 21. 

When the dual MAC in FIG. 2 is in the sort mode, the sum 
and difference of the adder input registers 57 are calculated 
as described above. The condition code 63 inputs the result 
from the second MAC adder 61 via the condition code input 
67 (still being used as a subtractor), and determines which 
of the two adder input register values 51 was greater. A 
decision is sent to the MAC output selector 66 via the 
condition code output 64. The decision is then used by the 
MAC output selector to determine how to arrange the adder 
input register values 51 on the first dual MAC output 21. The 
largest and smallest values will always appear in the same 
location on the first dual MAC outputs 21. 

When the dual MAC in FIG. 2 is in the accumulator 
mode, the accumulator crosspoint switch 62 uses the adder 
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input register values 51 and the accumulator register values 
50 as inputs. A top pair is sent to the first MAC adder 59 and 
a bottom pair is sent to the second MAC adder 61. In the 
accumulate mode, the second MAC adder 61 is used as an 
adder and not a subtracter The respective outputs of each of 
the first and the second MAC adders 59 and 61 are feedback 
to the accumulator registers 56 and added to the next values 
that are stored in the adder input registers 57. Final accu- 
mulator values are output from the first and the second MAC 
adders 59 and 61, sent to the MAC output selector 66, and 
eventually out of the dual MAC via the first dual MAC 
outputs 21. 

FIG. 3 illustrates a first pair of adders suitable for the 
adder array 14 shown in FIG. 1. The first rank of adders in 
FIG. 3 comprises a pair of cross point switches 84, register 
files 86, register selectors 88, adders 92 and condition code 
determiners 94. The cross point switches 84 accept input 
data streams to the arithmetic engine, outputs from the first 
dual MAC 21, and outputs from the second dual MAC 19. 
In addition to accepting external data, the cross point 
switches accept arithmetic engine output data. Outputs from 
the first, the second, and the third adder pairs, 74, 174, and 
274, respectively, are feedback to the first adder cross point 
switches 84. The selected values 85 from the cross point 
switch 84 are stored in the first adder registers 86. There 
must be at least eight registers in front of each adder 92 to 
efficiently perform a radix 8 butterfly algorithm. These 
registers buffer output data until it can be used by the adder. 
The need for buffering occurs because the input sequence of 
data does not match the sequence in which the data is 
consumed. 

First adder operand A 90 and first operand B 91 in FIG. 
3 are selected by the first adder register selector 88 from first 
adder registers 86 for use by the first adders 92. Adders 92 
can perform add or subtract operation. The first adder 
selectors 88 are also responsible for performing a sort 
operation as follows. First adder operands 90 and 91 are 
compared by first adder 92. The sign of the result indicates 
which operand was larger. The sign of the result is deter- 
mined by the condition code determiner 94 which feeds back 
the sort register select signal 97 to the first adder register 
selector 88, This signal is used to select which operand 90 
or 91 is output as the first operand sort 87. In addition, the 
signal is used to select which value is output as first register 
sort 89 from a register pair which are not compared. These 
connections allow a sort chain to be established which can, 
in the preferred embodiment described, locate the five 
highest values (peak search) in one pass of the data through 
the arithmetic engine. The value of the peaks are on first 
operands sort 87 while the locations of the peaks are on first 
registers sort 89. 

FIG. 4 illustrates a second pair of adders suitable for the 
adder array 14 shown in FIG. 1. The second pair of adders 
in FIG. 4 comprises a pair of cross point switches 184, 
register files 186, register selectors 188, adders 192 and 
condition code determiners 194. The cross point switches 

184 accept input data streams to the arithmetic engine, 
outputs from the first dual MAC 21, and outputs from the 
second dual MAC 19. In addition to accepting external data, 
the cross point switches accept arithmetic engine output 
data. Outputs from the first, the second, and the third adder 
pairs, 74, 174, and 274, respectively, are feedback to the 
second adder cross point switches 184. The selected values 

185 from the cross point switch 184 are stored in the second 
adder registers 186. Second adder operand A 190 and second 
operand B 191 are selected by the second adder register 
selector 188 from second adder registers 186 for use by the 
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second adders 192. Adders 192 can perform add or subtract 
operation. The second adder selectors 188 are also respon- 
sible for performing the sort operation. Second adder oper- 
ands 190 and 191 are compared by second adder 192. The 
sign of the result indicates which operand was larger. The 
sign of the result is determined by the condition code 
determiner 194 which feeds back the sort register select 
signal 197 to the second adder register selector 188. This 
signal is used to select which operand 190 or 191 is output 
as the second operand sort 187. In addition, the signal is used 
to select which value is output as second register sort 189 
from a register pair which are not compared. These connec- 
tions allow a sort chain to be established which can, in the 
preferred embodiment described, locate the five highest 
values (peak search) in one pass of the data through the 
arithmetic engine. The value of the peaks are on second 
operands sort 187 while the locations of the peaks are on 
second registers sort 189. 

FIG. 5 illustrates a third pair of adders suitable for the 
adder array 14 shown in FIG. 1. The third pair of adders in 
FIG. 5 comprises a pair of cross point switches 284, register 
files 286, register selectors 288, adders 292 and condition 
code determiners 294. The cross point switches 284 accept 
input data streams to the arithmetic engine, outputs from the 
first dual MAC 21, and outputs from the second dual MAC 
19. In addition to accepting external data, the cross point 
switches accept arithmetic engine output data. Outputs from 
the first, the second, and the third adder pairs, 74, 174, and 
274, respectively, are feedback to the first adder cross point 
switches 284. The selected values 285 from the cross point 
switch 284 are stored in the third adder registers 286. Third 
adder operand A 290 and third operand B 291 are selected 
by the third adder register selector 288 from third adder 
registers 286 for use by the third adders 292. Adders 292 can 
perform add or subtract operation. The third adder selectors 

288 are also responsible for performing the sort operation. 
Third adder operands 290 and 291 are compared by third 
adder 292. The sign of the result indicates which operand 
was larger. The sign of the result is determined by the 
condition code determiner 294 which feeds back the sort 
register select signal 297 to the third adder register selector 
288. This signal is used to select which operand 290 or 291 
is output as the third operand sort 287. In addition, the signal 
is used to select which value is output as third register sort 

289 from a register pair which are not compared. These 
connections allow a sort chain to be established which can, 
in the preferred embodiment described, locate the five 
highest values (peak search) in one pass of the data through 
the arithmetic engine. The value of the peaks are on third 
operands sort 287 while the locations of the peaks are on 
third registers sort 289. 

As an example illustrating how the dual MACs 12 and 13 
in FIG. 1 work with the adder array 14, a description of the 
radix 8 butterfly will be given. The radix 8 butterfly function 
acts on eight complex numbers producing eight complex 
outputs. Data passes through the FIG. 2 dual multipliers 30, 
dual accumulator 34, and each rank of the adder arrays 95, 
195, 295 in FIG. 3 using all of the resources of the arithmetic 
engine each instruction. During each instruction, a complex 
multiply is performed in dual MACs 12 and 13 and six add 
operations occur in the adder array 14 (add or subtract). 
Twelve instructions are required to fill the arithmetic engine 
11 registers with data and produce the first output. After 
arithmetic engine pipeline is full, output data is produced 
every instruction, as can be seen in the detailed description 
below. 

This detailed discussion of how data is sequenced through 
the arithmetic engine during a radix 8 butterfly refers to 
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Table 1, Radix 8 Pipeline Sequence. Table 1 shows how data 
is transferred from register to register for each instruction. 
Data enters the arithmetic engine at the dual MAC cross 
point switch 42 in FIG. 2 and is stored in the multiplier input 
registers 46. There are four registers assigned to each 
multiplier. Registers M11-M14 are assigned to first parallel 
multiplier 52. Registers M21-M24 are assigned to second 
parallel multiplier 54 of the first dual MAC 12. Registers 
M31-M34 and M41-M44 are assigned to the multipliers of 
the second dual MAC 13. Input data consists of complex 
data samples X7, Y7 and complex coefficients Zx7, Zy7. 
Instruction 1 loads eight registers simultaneously through 
the cross point switch (Mil, M41, M21, M31, M13, M33, 
M23, M43). 



10 



8 



Instruction 2 loads the next complex data pair (X6.Y6), 
(Zx3,Zy3) into the multiplier registers 46 and multiplies data 
from the previous fetch and stores it into the adder input 
registers 57, which are designated All, A12 for the first dual 
Mac 12 and A21, A22 for the second dual MAC 13. 

Instruction 3 fetches the third complex pair (X5,Y5) 
(Zx5,Zy5) and completes the complex multiply of the first 
data sample by adding and subtracting the previous adder 
input registers 57 (All, A12, A21, A22). The results are 
stored in the first adder registers 86 of the adder array 14 
(A38.A48) (in FIG. 3). 



TABLE 1 



RADIX 8, 8 CYCLE BUTTERFLY PIPEFILL SEQUENCE 



Output of Output of 1st Output of 2nd Output of 3rd 

Input to MAC Multiplier Output of MAC Adder Pair Adder Pair Adder Pair 



Instruction #: 1 

X7-»M11,M41 
Y7-»M21,M31 
Zx7->M13,M33 
Zy7-*M23,M43 
Instruction #: 2 



X6-»M11,M41 
Y6-»M21,M31 
Zx3->M13,M33 
Zy3-*M23,M43 
Instruction #: 3 



M11*M13-»A11 
M21*M23->A12 
M31*M33->A21 
M41*M43-»A22 



X5->M11,M41 
Y5-»M21,M31 
Zx5-»M13,M33 
Zy5->M23,M43 
Instruction #: 4 



M11*M13->A11 
M21»M23->A12 
M31*M33->A21 
M41*M43-»A22 



A11+A12->A38 
A21-A22->A48 



X4-»M11,M41 
Y4->M21,M31 
Zxl->M13,M33 
Zyl-*M23,M43 
Instruction #: S 



M11*M13-*A11 
M21*M23-*A12 
M31*M33-»A21 
M41*M43-»A22 



A11+A12-»A34 
A21-A22->A44 



X3-»Mll,M41 
Y3->M21,M31 
Zx6-»M13JV133 
Zy6->M23.M43 
Instruction #: 6 



M11*M13-*AI1 
M21*M23-»AI2 
M31*M33-»A21 
M41*M43-»A22 



A11+A12->A37 
A21-A22->A47 



A34-A38->M42 
A44-A48-»M22 



X2-»M11,M41 
Y2-»M21,M31 
Zx2-*M13,M33 
Zy2-*M23,M43 
Instruction #: 7 



M11*M13~*A11 
M21*M23-»A12 
M31*M33-»A21 
M41*M43-»A22 



A11+A12-»A33 
A21-A22-»A43 



A34+A38-*A57 
A44+A48-*A67 



X1-»M11,M41 
Y1-»M21,M31 
Zx4-*M13,M33 
Zy4->M23,M43 
Instruction 8 



M11*M13-»A)1 
M21*M23-»A12 
M31*M33->A21 
M41*M43->A22 



A11+A12-»A36 
A21-A22-»A46 



A33-A37->M12 
A43-A47-»M32 



X0->A31 M11*M13-»A11 A11+A12-»A32 A33+A37->A53 

Y0->A41 M21*M23-»A12 A21-A22-»A42 A43+A47-*A63 

M31*M33-»A21 

M41*M43-»A22 

Instruction #: 9 



X15->M11,M41 
Y15-»M21,M31 
Zx7->M13,M33 
Zy7->M23,M43 



MI4*M12->AU 
M24*M22->A12 
M34*M32-»A21 
M44*M42->A22 



A11+A12-»A35 
A21-A22-»A45 



A32+A36-»A55 
A42+A46-»A65 



A53+A57-)A75 
A63+A67-»A85 
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TABLE 1 -continued 



RADIX 8, 8 CYCLE BUTTERFLY PIPEFILL SEQUENCE 

Output of Output of 1st Output of 2nd Output of 3rd 

Input to MAC Multiplier Output of MAC Adder Pair Adder Pair Adder Pair 



Instruction #: 10 



X14-»M11,M41 
Y14->M21,M31 
Zx3-»M13 t M33 
Zy3->M23,M43 
Instruction #: 1 1 

X13-»MU,M41 
Y13-»M21,M31 
Zx5->M13,M33 
Zy5->M23,M43 
Instruction #: 12 



M11*M13-»A11 
M21*M23->A12 
M31*M33-»A21 
M41*M43-»A22 



Mll*M13->All 
M21*M23-»A12 
M31*M33-»A21 
M41*M43-»A22 



A11+A12-*A54 
A11-A12->A64 
A21+A22->A68 
A21-A22-»A58 



AI1+A12-*A38 
A21-A22->A48 



A31+A35-*A51 
A41+A45-*A61 



A32-A36->A66 
A42-A46->A56 



A53-A57->A87 
A63-A67->A77 



A51+A55->A71 
A6I+A65-»A81 



X12-»Mll,M4l 
Y12-»M21,M3l 
Zxl->M13,M33 
Zyl->M23,M43 
Instniciion #: 13 

X11-»M11,M4I 
Y11-»M21,M31 
Zy6->M13,M33 
Zy6-»M23,M43 
Instruction #: 14 



M11*M13->A11 
M21*M23->A12 
M31+M33->A21 
M41+M43->A22 



M11*M13-»A11 
M21*M23->A12 
M31*M33-»A21 
M41*M43-»A22 



AU+A12-»A34 
A21-A22-*A44 



A11+A12-M37 
A21-A22->A47 



A31-A35-»A52 
A41-A45->A62 



A34-A38->M42 
A44-A48->M22 



A51-A55->A73 
A61-A65->A83 



A54-A58->A86 
A64-A68-»A78 



A71+A75-»X0 
A81+A85-»Y0 



X10-»M11,M41 
Y10->M21,M3I 
Zx2->M13,M33 
Zy2->M23,M43 
Instruction #: 15 



MH*M13->AU 
M2l*M23-»A12 
M31*M33-»A21 
M41*M43->A22 



A11+AI2-4A33 
A21-A22-»A43 



A34+A38->A57 
A44+A48->A67 



A54+A58-»A76 
A64+A68->A88 



A71-A75->X4 
A81-A85-»Y4 



X9->Mll,M41 
Y9-»M21,M31 
Zx4->M13,M33 
Zy4-»M23 ( M43 
Instruction U: 16 



M11*M13->A11 
M21*M23->A12 
M31+M33->A2l 
M4I*M43-»A22 



A11+A12->A36 
A21-A22->A46 



A33-A37-»M12 
A43-A47->M32 



A52+A56-»A72 
A62+A66-M84 



A73+A77-»X2 
A83-A87->Y2 



X8->A31 
Y8->A41 



Instruction #: 17 



M11*M13->A11 
M21*M23-»A12 
M3l*M33-»A21 
M41*M43-»A22 



A11+A12-»A32 
A21-A22->A42 



A33+A37-»A53 
A43+A47->A63 



A52-A56->A74 
A62-A66-»A82 



A73-A77-»X6 
A83+A87-»Y6 



X23->M11,M41 
Y23-»M21,M31 
Zx7-*M13,M33 
Zy7->M23,M43 
Instruction #: 18 



M14«M12-»A11 
M24*M22-»A12 
M34*M32-»A21 
M44*M42->A22 



A11+A12-»A35 
A21-A22->A45 



A32+A36-*A55 
A42+A46-»A65 



A53+A57-»A75 
A63+A67->A85 



A72+A76->X1 
A82-A86->Y1 



X22-»M11,M41 
Y22-*M21,M31 
Zx3->M13,M33 
Zy3->M23.M43 
Instruction #: 19 



M14*M12-»A11 
M24*M22->A12 
M34*M32-*A21 
M44*M42->A22 



A11+A12->A54 
A1I-A12-»A64 
A21+A22-»A68 
A21-A22->A58 



A31+A35-»A51 
A41+A45->A61 



A53-A57-*A87 
A63-A67->A77 



A72-A76->X5 
A82+A86->Y5 



X21-»MU,M4I 
Y21->M21,M31 
Zx5->M13,M33 
Zy5->M23,M43 
Instruction S: 20 



M14*MI2->AU 
M24*M22->A12 
M34*M32-»A2I 
M44*M42->A22 



A11+A12-»A38 
A21-A22-»A48 



A32-A36-»A66 
A42-A46->A56 



A5I +A55-»A71 
A61 +A65-»A81 



A74-A78->X3 
A84-A88-»Y3 



X20-»M11,M41 
Y20->M2l,M31 
Zxl->M13,M33 
Zyl->M23,M43 
Instruction #: 21 



M14*M12-»All 
M24*M22->A12 
M34*M32->A21 
M44*M42-»A22 



A11+A12-»A34 
A21-A22-4A44 



A31-A35-»A52 
A41-A45-»A62 



A51-A55-»A73 
A61-A65-»A83 



A74+A78->X7 
A84+A88-»Y7 



X19->M11,M41 
Y19-»M21,M31 
Zx6-»M13,M33 
Zy6-»M23,M43 



M14*M12->A11 
M24*M22-»A12 
M34*M32->A21 
M44*M42->A22 



AU-i-A12->A37 
A21-A22->A47 



A34-A38-»M42 
A44-A48-»M22 



A54-A58-»A86 
A64-A68-»A78 



A7I+A75->X8 
A81+A85-»Y8 
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RADIX 8, 8 CYCLE BUTTERFLY PIPEFHX SEQUENCE 



Output of Output of 1st Output of 2nd Output of 3rd 

Input to MAC Multiplier Output of MAC Adder Pair Adder Pair Adder Pair 



Instruction #: 22 












X18-»M11,M41 
Y18->M21,M31 
Zx2-*M13,M33 
Zy2-»M23,M43 
Instruction #: 23 


M14*M12->A11 
M24*M22->A12 
M34*M32-*A21 
M44*M42->A22 


A11+AI2->A33 
A21-A22->A43 


A34+A38-»A57 
A44+A48->A67 


A54fA58-»A76 
A64+A68->A88 


A71-A75-OC12 
A81-A85-»Y12 


XI7-*M11,M41 
Y17->M21,M31 
Zx4->M13,M33 
Zy4->M23,M43 


MI4+M12-»AU 
M24*M22->A12 
M34*M32->A21 
M44*M42-»A22 


All+A12-w\36 
A21-A22-»A46 


A33-A37-+M12 
A43-A47-»M32 


A52+A56->A72 
A62+A66-»A84 


A73+A77-»X10 
A83-A87->Y10 



Instruction 4 fetches the fourth complex pair (X4,Y4) 
(Zxl,Zyl), completes the complex multiply of the second 20 
sample by adding and subtracting the previous adder input 
registers 57, and stores the results in the first adder registers 
86 of the adder array 14 (A34, A44). 

Instruction 5 fetches the fifth complex pair (X3,Y3) 
(Zx6,Zy6), completes the complex multiply of the third 25 
sample, and stores the results in the first adder registers 86 
(A37, A47). The previous values loaded into the first adder 
registers in instructions 3 and 4 (A34, A38, A44, A48) are 
subtracted and stored back in the dual MAC multiplier 
registers 46 (M42, M22) to be used later in instruction 9. 30 
This instruction demonstrates the feedback properties of the 
arithmetic engine. The feedback is applied to maintain 
complete utilization of the arithmetic engine resources while 
calculating the radix 8 butterfly. 

Instruction 6 fetches the sixth complex pair (X2,Y2) 35 
(Zx2,Zy2), completes the complex multiply of the fourth 
sample, and stores the results in the first adder registers 86 
(A33, A43). The values loaded into the first adder registers 
in instructions 3 and 4 (A34, A38, A44, A48) are added and 
stored in the second adder registers 186 of the second rank 40 
adder block (FIG. 4 A57 and A67). 

Instruction 7 fetches the seventh complex pair (X1,Y1) 
(Zx4,Zy4), completes the complex multiply of the fifth 
sample, and stores the results in the first adder registers 86 
(A36, A46). The previous values loaded into the first adder 45 
registers in instructions 5 and 6 (A33, A37, A43, A47) are 
subtracted and stored back in the dual MAC multiplier 
registers 46 (M12, M32) to be used later in instruction 9. 
This instruction again demonstrates the feedback properties 
of the arithmetic engine. 

Instruction 8 fetches the pair (X0,Y0) and stores them 
directly into A31 and A41 of the first adder registers 86. The 
complex multiply of the sixth sample is completed, and the 
values loaded into the first adder registers in instruction 5 
and 6 (A33, A37, A43, A47) are added and loaded into the 55 
second adder registers 186 (A53, A63). 

Instruction 9 fetches the ninth complex pair (X15,Y15) 
(Zx7,Zy7), completes the complex multiply of the seventh 
instruction, adds the previous values of the first adder 
registers in instruction 7 and 8 (A32, A36, A42, A46) and go 
loads them into the second adder registers 186 (A55, A65). 
The previous values loaded in the second adder registers in 
instruction 6 and 8 (A53, A57, A63, A67) are added and 
stored in the third adder registers 286 (FIG. 5 A7S and A85). 
This same process continues through instruction 12. g 5 

After instruction 12, the pipeline is full and the final 
results of the radix eight butterfly are available beginning 



with instruction 13. Instruction 13 operates like instruction 
9 through 12 with the exception of a final operation. Instruc- 
tion 13 takes the previous values loaded into the third adder 
registers 286 in instructions 9 and 1 1 (A71, A75, A81, A85), 
adds them together, and sends the result (X0, Y0) to 
memory. This result is the first output value of the radix 8 
butterfly. Instruction 14 outputs the second result (X4, Y4) 
and the process continues for a single butterfly through 
instruction 20 where the last value is output (X7, Y7). 

In summary, the Fast Fourier Transform is a fundamental 
signal processing application. The speed at which this trans- 
form can be computed often determines which digital signal 
processor is chosen for a given application. Higher radix 
butterflies decrease the execution time of the FFT by 
increasing throughput, provided there are sufficient 
resources to store intermediary values. The arithmetic 
engine contemplated here, while optimized for a radix 8 
butterfly, can also perform radix 2 and radix 4 butterflies 
efficiently. 

The significance of this arithmetic engine described lies in 
its architecture. The architecture is designed to input and 
operate on complex data in ordered pairs. Each adder pair is 
designed to operate on a complex number. There are three 
ranks of adder pairs because of the Radix 8 butterfly 
requirement. The output of the first adder pair 74 provides 
radix 2 results, the output of the second adder pair 174 
provides radix 4 results, and the output of the third adder pair 
274 provides radix 8 results. 

Another important capability designed into the hardware 
is high speed data sorting. Most arithmetic engines simply 
pass data through from the input to the output. The engine 
described here differs because it recirculates data within the 
engine by feeding the outputs back to the inputs. This feature 
allows full use of computing resources and is essential for 
parallel sorting operations where all of the adders and 
switches, are active. Use of a cross point switch allows any 
combination of inputs to be loaded simultaneously into any 
combination of registers. Typically, simple multiplexers are 
used which only allow one or two registers to be loaded at 
a time. 

Urns, there has also been provided, in accordance with an 
embodiment of the invention, an arithmetic engine that fully 
satisfies the aims and advantages set forth above. While the 
invention has been described in conjunction with a specific 
embodiment, many alternatives, modifications, and varia- 
tions will be apparent to those of ordinary skill in the art in 
light of the foregoing description. Accordingly, the invention 
is intended to embrace all such alternatives, modifications, 
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and variations as fall within the spirit and broad scope of the 
appended claims. 
What is claimed is: 
1. An arithmetic engine comprising: 
a first dual multiplier accumulator (MAC) for receiving 5 
input data at a first dual MAC first input and for 
producing first dual MAC output data at a first dual 
MAC output, wherein the first dual MAC comprises: 
a dual multiplier for receiving a plurality of sources and 
for producing first and second parallel multiplier 10 
outputs, the dual multiplier including a multiplier 
cross point switch for receiving the plurality of 
sources and for producing selected sources at a 
plurality of multiplier cross point switch outputs, 
wherein the dual multiplier comprises: J5 
a plurality of multiplier registers coupled to the 
multiplier cross point switch outputs, the plurality 
of multiplier registers for storing the selected 
sources as multiplier register values; 
a register selector coupled to the plurality of multi- 2 o 
plier registers, the register selector for choosing 
selected multiplier operands from the multiplier 
register values; and 
first and second parallel multipliers coupled to the 
register selector, the first and the second parallel 25 
multipliers for receiving pairs of selected multi- 
plier operands and multiplying the pairs of 
selected multiplier operands to produce the first 
and the second parallel multiplier outputs, respec- 
tively; 30 
an accumulator coupled to the dual multiplier, the 
accumulator for receiving the first and second par- 
allel multiplier outputs, selecting particular first and 
second parallel multiplier outputs in a condition code 
determiner, and for producing the first dual MAC 35 
output data therefrom, the accumulator comprising: 
a plurality of adder registers for storing the first and 
the second parallel multiplier outputs and a plu- 
rality of accumulator registers for storing a first 
and a second accumulator output; 40 
an accumulator cross point switch coupled to the 
plurality of adder registers and the plurality of 
accumulator registers, the accumulator cross point 
switch for producing selected accumulator register 
and adder register values from the first and the 45 
second parallel multiplier outputs and the first and 
the second accumulator outputs; 
first and second parallel MAC adders coupled to the 
accumulator cross point switch, the first and the 
second parallel MAC adders for receiving pairs of 5 q 
selected accumulator register and adder register 
values and adding the pairs of selected accumu- 
lator register and adder register values to produce 
the first and the second accumulator adder outputs; 
and 55 
a MAC output selector coupled to the first and to the 
second parallel MAC adders and to the first and to 
the second parallel multipliers, the MAC output 
selector for receiving the first and the second 
accumulator outputs and the first and the second 60 
parallel multiplier outputs and for producing the 
first dual MAC output data; and 
a first dual MAC second input; 
a second dual MAC coupled in parallel to the first dual 
MAC, the second dual MAC for receiving the input 65 
data at a second dual MAC first input and for producing 
second dual MAC output data at a second dual MAC 
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output, wherein the second dual MAC comprises a 
second dual MAC second input; and 
an adder array coupled to both the first dual MAC and to 
the second dual MAC, the adder array for receiving the 
input data, the first dual MAC output data, and the 
second dual MAC output data and for producing arith- 
metic engine output data at an adder array output, 
wherein the first dual MAC second input is coupled to 
the adder array output, the first dual MAC output is 
coupled to the adder array output, the second dual 
MAC second input is coupled to the adder array output, 
and the second dual MAC output is coupled to the 
adder array output. 

2. An arithmetic engine as claimed in claim 1, wherein the 
second parallel MAC adder produces a condition code input 
to the condition code determiner, and the condition code 
determiner produces a condition code output to the MAC 
output selector. 

3. An arithmetic engine as claimed in claim 1, wherein the 
adder array comprises: 

a first adder pair for receiving the input data, the first dual 
MAC output data, the second dual MAC output data 
and first, second, and third adder pair outputs, and for 
producing first adder pair outputs; 

a second adder pair coupled to the first adder pair, the 
second adder pair for receiving the input data, the first 
dual MAC output data, the second dual MAC output 
data, and the first, the second, and the third adder pair 
outputs and for producing second adder pair outputs; 
and 

a third adder pair coupled to the second adder pair, the 
third adder pair for receiving the input data, the first 
dual MAC output data, the second dual MAC output 
data, and the first, the second, and the third adder pair 
outputs and for producing the arithmetic engine output 
data at the adder array output. 

4. An arithmetic engine as claimed in claim 3, wherein 
each adder of the first adder pair comprises: 

a first adder cross point switch for producing first selected 
values of the input data, the first MAC output data, the 
second MAC output data and the first, the second, and 
the third adder pair outputs; 

first adder registers coupled to the first adder cross point 
switch, the first registers for storing the first selected 
values; 

a first adder register selector coupled to the first adder 
registers, the first adder register selector for receiving 
and choosing first operands and for producing a first 
adder sum output from the first selected values; 

a first adder coupled to the first adder register selector, the 
first adder for receiving and adding the first operands 
and for producing one of the first adder pair outputs; 
and 

a first condition code determiner for receiving a first adder 
condition code input and for providing a first condition 
code to the first adder register selector. 

5. An arithmetic engine as claimed in claim 3, wherein 
each adder of the second adder pair comprises: 

a second adder cross point switch for producing second 
selected values of the input data, the first dual MAC 
output data, the second dual MAC output data, and the 
first, the second, and the third adder pair outputs; 

second adder registers coupled to the second adder cross 
point switch, the second adder registers for storing the 
second selected values; 



09/30/2003, EAST version: 1.04.0000 



5,522,1 

15 

a second adder register selector coupled to the second 
adder registers, the second adder register/selector for 
receiving and choosing second operands from the sec- 
ond selected values; 

a second adder coupled to the second adder register 5 
selector, the second adder for receiving and adding the 
second operands and for producing one of the second 
adder pair outputs; and 

a second condition code determiner for receiving a second 
adder condition code input and for providing a second 10 
condition code to the second adder register selector. 

6. An arithmetic engine as claimed in claim 3, wherein 
each adder of the third adder pair comprises: 

a third adder cross point switch for producing third 
selected values of the input data, the first dual MAC 15 
output data, the second dual MAC output data, and the 
first, the second and the third adder pair outputs; 

third adder registers coupled to the third adder cross point 
switch, the third adder registers for storing the third 
selected values; 20 

a third adder register selector coupled to the third adder 
registers, the third adder register selector for receiving 
and choosing third operands from the third selected 
values; 

25 

a third adder coupled to the third adder register selector, 
the third adder for receiving and adding the third 
operands and for producing the arithmetic engine out- 
put data; and 

a third condition code determiner for receiving a third 30 
condition code input and for providing a third condition 
code to the third adder register selector. 

7. A complex arithmetic processor including an arithmetic 
engine comprising: 

a first dual multiplier accumulator (MAC) for receiving 35 
input data at a first dual MAC first input and for 
producing first dual MAC output data at a first dual 
MAC output, wherein the first dual MAC comprises: 
a dual multiplier for receiving a plurality of sources and 
for producing first and second parallel multiplier 40 
outputs, the dual multiplier including a multiplier 
cross point switch for receiving the plurality of 
sources and for producing selected sources at a 
plurality of multiplier cross point switch outputs, 
wherein the dual multiplier comprises: 45 
a plurality of multiplier registers coupled to the 
multiplier cross point switch outputs, the plurality 
of multiplier registers for storing the selected 
sources as multiplier register values; 
a register selector coupled to the plurality of multi- 50 
plier registers, the register selector for choosing 
selected multiplier operands from the multiplier 
register values; and 
first and second parallel multipliers coupled to the 
register selector, the first and the second parallel 55 
multipliers for receiving pairs of selected multi- 
plier operands and multiplying the pairs of 
selected multiplier operands to produce the first 
and the second parallel multiplier outputs, respec- 
tively; 60 
an accumulator coupled to the dual multiplier, the 
accumulator for receiving the first and second par- 
allel multiplier outputs, selecting particular first and 
second parallel multiplier outputs in a condition code 
determiner, and for producing the first dual MAC 65 
output data therefrom, wherein the accumulator 
comprises: 
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a plurality of adder registers for storing the first and 
the second parallel multiplier outputs and a plu- 
rality of accumulator registers for storing a first 
and a second accumulator output; 

an accumulator cross point switch coupled to the 
plurality of adder registers and the plurality of 
accumulator registers, the accumulator cross point 
switch for producing selected accumulator register 
and adder register values from the first and the 
second parallel multiplier outputs and the first and 
the second accumulator outputs; 

first and second parallel MAC adders coupled to the 
accumulator cross point switch, the first and the 
second parallel MAC adders for receiving pairs of 
selected accumulator register and adder register 
values and adding the pairs of selected accumu- 
lator register and adder register values to produce 
the first and the second accumulator adder outputs; 
and 

a MAC output selector coupled to the first and to the 
second parallel MAC adders and to the first and to 
the second parallel multipliers, the MAC output 
selector for receiving the first and the second 
accumulator outputs and the first and the second 
parallel multiplier outputs and for producing the 
first dual MAC output data; 
a first dual MAC second input; 
a second dual MAC coupled in parallel to the first dual 
MAC, the second dual MAC for receiving the input 
data at a second dual MAC first input and for producing 
second dual MAC output data at a second dual MAC 
output, wherein the second dual MAC comprises a 
second dual MAC second input; and 
an adder array coupled to both the first dual MAC and to 
the second dual MAC, the adder array for receiving the 
input data, the first dual MAC output data, and the 
second dual MAC output data and for producing arith- 
metic engine output data at an adder array output, 
wherein the first dual MAC second input is coupled to 
the adder array output, the first dual MAC output is 
coupled to the adder array output, the second dual 
MAC second input is coupled to the adder array output, 
and the second dual MAC output is coupled to the 
adder array output. 

8. A complex arithmetic processor as claimed in claim 7, 
wherein the second parallel MAC adder produces a condi- 
tion code input to the condition code determiner, and the 
condition code determiner produces a condition code output 
to the MAC output selector. 

9. A complex arithmetic processor as claimed in claim 7, 
wherein the adder array comprises: 

a first adder pair for receiving the input data, the first dual 
MAC output data, the second dual MAC output data 
and first, second, and third adder pair outputs, and for 
producing first adder pair outputs; 

a second adder pair coupled to the first adder pair, the 
second adder pair for receiving the input data, the first 
dual MAC output data, the second dual MAC output 
data, and the first, the second, and the third adder pan- 
outputs and for producing second adder pair outputs; 
and 

a third adder pair coupled to the second adder pair, the 
third adder pair for receiving the input data, the first 
dual MAC output data, the second dual MAC output 
data, and the first, the second, and the third adder pah- 
outputs and for producing the arithmetic engine output 
data at the adder array output. 
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10. A complex arithmetic processor as claimed in claim 9, 
wherein each adder of the first adder pair comprises: 

a first adder cross point switch for producing first selected 
values of the input data, the first MAC output data, the 
second MAC output data and the first, the second, and 5 
the third adder pair outputs; 

first adder registers coupled to the first adder cross point 
switch, the first registers for storing the first selected 
values; 

a first adder register selector coupled to the first adder 
registers, the first adder register selector for receiving 
and choosing first operands and for producing a first 
adder sum output from the first selected values; 

a first adder coupled to the first adder register selector, the 15 
first adder for receiving and adding the first operands 
and for producing one of the first adder pair outputs; 
and 

a first condition code determiner for receiving a first adder 
condition code input and for providing a first condition 20 
code to the first adder register selector. 

11. A complex arithmetic processor as claimed in claim 9, 
wherein each adder of the second adder pair comprises: 

a second adder cross point switch for producing second 
selected values of the input data, the first dual MAC 2 5 
output data, the second dual MAC output data, and the 
first, the second, and the third adder pair outputs; 

second adder registers coupled to the second adder cross 
point switch, the second adder registers for storing the 
second selected values; 

a second adder register selector coupled to the second 
adder registers, the second adder register selector for 
receiving and choosing second operands from the sec- 
ond selected values; 
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a second adder coupled to the second adder register 
selector, the second adder for receiving and adding the 
second operands and for producing one of the second 
adder pair outputs; and 

a second condition code determiner for receiving a second 
adder condition code input and for providing a second 
condition code to the second adder register selector. 

12. A complex arithmetic processor as claimed in claim 9, 
wherein each adder of the third adder pair comprises: 

a third adder cross point switch for producing third 
selected values of the input data, the first dual MAC 
output data, the second dual MAC output data, and the 
first, the second and the third adder pair outputs; 

third adder registers coupled to the third adder cross point 
switch, the third adder registers for storing the third 
selected values; 

a third adder register selector coupled to the third adder 
registers, the third adder register selector for receiving 
and choosing third operands from the third selected 
values; 

a third adder coupled to the third adder register selector, 
the third adder for receiving and adding the third 
operands and for producing the arithmetic engine out- 
put data; and 

a third condition code determiner for receiving a third 
condition code input and for providing a third condition 
code to the third adder register selector. 
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[57] ABSTRACT 

Circuit arrangement for calculating matrix operations, 
such as those which recur frequently in signal process- 
ing, specifically in conjunction with neural networks, 
having a systolic array of multipliers and adders, down- 
stream from which a recursive accumulator is con- 
nected. In addition to products, sums and differences of 
matrices, this circuit arrangement also allows squares, 
absolute magnitudes of sums and differences and 
squares of sums and differences of two matrices to be 
calculated very efficiently. Furthermore, with the aid of 
the recursive accumulator, it is possible to transpose 
matrices, to calculate row sums and column sums, and 
to search for minimum or maximum matrix elements. 
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the matrices which are to be processed by the circuit 
CIRCUIT ARRANGEMENT FOR CALCULATING arrangement are partitioned into k x k submatrices; 
MATRIX OPERATIONS IN SIGNAL PROCESSING the overall circuit has first and second inputs, one 

output and one input/output; 
BACKGROUND OF THE INVENTION 5 the matrix multiplier has the following units: 
The invention relates to a circuit arrangement for dual-port first and second memory units for the stor- 
calculating matrix operations which recur frequently, age and of input submatrices, the first 

such as those in signal processing, especially in the mem °T y Umt connected to the first input and the 
context of neural networks. Since the computation op- lft ^^f^ T l ^ t0 ^ * e ^put/out- 

erations which are required for calculating neural n<£ 10 pUt ' *"? ^ fa&t memory mt havn « k dependent 
i , , , memories t 

works can be reduced to a comprehensible number of A e ,^*^iL * . • , ■ * . - 

« . ^ y . . ; z , a systolic arrangement, which is connected to the first 

dementery mamx operations x is sensible from the md ^ men f ^ ^ . wWdl « S 
point of view of the . necessary mgh computation speed ^ by control signals, which arrangement has k 
m the execution of these operations to implement such 15 mu i ti pHers and k dowSm^onnected fdders for the 
computation operations in hardware, rather than to multipHcation and addition/subtraction of input subma- 
carry them out with the aid of software. trices, the first memory unit storing the rows of the 

A pnor art which is closest to the invention is repro- submatrix in k separate sub-memory units, and supply- 
duced m the publication by U. Ramacher, "Design of a ing this data via k separate supply lines to the multipli- 
first Generation Neurocomputer", VLSI Design of 20 ers, so that each multiplier, with the downstream-con- 
Neural Networks, edited by U. Ramacher, U. Ruckert, nected adder successively, and in a synchronized man- 
Kluwer Academic Publishers, Nov. 1990. This publica- ner, carries out a multiplication of a matrix element 
tion describes a circuit arrangement which is con- from the corresponding row of the first memory unit 
structed from a systolic arrangement of multipliers and with the matrix element of a row of the submatrix from 
adders. This systolic arrangement makes it possible to 25 the second memory element, and the results are accu- 
calculate matrix products, the matrices which are to be mulated, via the adder chain which has the k adders of 
multiplied being split into blocks of size 4X4, and it the matrix multiplier, to form product sums; 
being possible to multiply submatrices of this size with a first final adder of the carry-select type is in this 
the aid of the systolic arrangement in each case. The case provided at the end of this adder chain, and the 
computation operations which can be carried out using 30 recursive accumulator has the following units: 
this circuit arrangement are suitable for the calculation third and fourth memory units, both of which are 
of specific neural network types, for example multilayer connected to the second input, the third memory unit 
perceptron networks with feedback. storing the weightings and the fourth memory unit 

The disadvantages of this circuit arrangement, as it is storing the coefficients; 
described in the publication by U. Ramacher 1990, are 35 a Anther multiplier for the multiplication of the prod- 
primarily that: uct m the 6031 a <*der by the coefficients or for 

the transposition, addition and subtraction are not the product sums, which further multiplier is 

supported by matrices, connected at a second input either to the output of the 

result matrices cannot be squared or cannot be multi- fmal adder or t0 Ae out P ut of ^ fourth memory 

plied by a scalar, and that 40 umt; . 

the calculation of row and column sums and the * fu f th ?" a ,? der ' ^ h » connected downstream 
search for extreme matrix elements are not supported fr °™ ? rth f ^P^j u f <* *e component-based 
by this circuit arrangement £^ P SUmS ^ the westings from the 

Furthermore, in the case of this circuit arrangement, , c memory; 
there is no monitoring of the value range of the matrix 45 J^StS^ whl j* lS ^ do ^™ 
coefficients, and the vtlue ranges of the matm elements tlQ ™ ^ further md » <* *e carry-select type, 
are not limited if an overflow occurs. & recureive loQp via ^ fimher addef> ^ md 

SUMMARY OF THE INVENTION final adder or a comparator, a shifter and an overflow- 

m . l. u *u * . , , . 50 monitoring circuit, as well as via the third memory unit 

The object on which ^ for comp S nent . ba sed MIN/MAX comparison of the 

%L!?r\^ em ?* y ^ ° fwlUCh ^ ^ result submatrices with a submatrix which is present at 
scnbeddisady^Ug^ the third input, and for overflow regulation of the 

which supports the calculation of matrix products and wei ghtings which are stored in the third memory unit 
the multiphcation of matrix products by scalars, as well 55 A systolic a^gement of multipliers and adders is 
as the squaring of matrix products, sums and different likewise provided in the circuit arrangement according 
formation of matrices, the multiplication of matrix sums t0 ^ invention. In contrast to the circuit arrangement 
and matrix differences by scalars, forming the magni- kn 0WQ f rom the prior art, in the circuit arrangement 
tude of matrix sums and matrix differences, squaring according to the invention, a recursive accumulator is 
matrix sums and matrix differences, transposition of 60 connected downstream from this systolic arrangement 
matrices, and matrix products, the calculation of row 0 f multipliers and adders. It is possible to carry out a 
and column sums of matrices, and the search for ex- considerably more comprehensive class of computation 
treme, that is to say minim u m and m a xim u m , matrix operations with the aid of this recursive accumulator. In 
elements. This object is achieved by means of a circuit particular, using the circuit arrangement according to 
arrangement for calculating matrix operations, having a 65 the invention, it is possible to multiply matrix products 
matrix multiplier and a recursive accumulator, which is by scalars, to square matrices or matrix products, to 
connected downstream from this matrix multiplier, and form sums or differences of matrices and to multiply by 
having the following features: scalars, to calculate the absolute magnitude of matrix 
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sums and matrix differences and their squares, and to bidirectional interface which is designated by inout2. 
transpose matrices. Furthermore, the circuit arrange- The bidirectional interface can be monitored with the 
ment according to the invention allows the calculation aid of the signal Port2c (see FIG. 2). The coefficients of 
of row and column sums of matrices and matrix prod: two matrices which are to be multiplied with one an- 
ucts and sums or differences of matrices. Finally, this 5 other are supplied via the interfaces inl and inout2 
circuit arrangement makes it possible to search for max- respectively and the memories Bl and B2 respectively, 
imum and minimum matrix elements in previously cal- connected thereto, to a systolic arrangement compris- 
culated matrices. The present invention is also a circuit ing four multipliers and four adders. The input data of 
arrangement for calculating transpositions, row sums the multipliers of the systolic arrangement are in this 
and column sums of matrices and for searching for 10 case selected with the aid of the switch SI, which is 
extreme matrix elements, which has a multiplier and an controlled via the signal SEN1. The adder chain of the 
adder which is connected downstream from this multi- systolic arrangement is connected to the output of an 
plier and links the output of the multiplier to an output EXOR gate whose first input is connected to the mem- 
of this circuit arrangement, in a recursive manner. ory B2 via the switch S2, which is controlled with the 

The present invention is further a circuit arrangement 15 aid of the signal SEN2. The other inputs of the multipli- 
for calculating matrix operations, which has a matrix ers are connected to the memory Bl. This memory is a 
multiplier and a recursive accumulator, which is con- combination of four independent memories which in 
nected downstream from the matrix multiplier, for cal- each case comprise two-times four words. Latches are 
culating transpositions, row sums and column sums and provided in all cases for buffer storage of data and cbm- 
for searching for extreme matrix elements. The accumu- 20 putation results in the systolic arrangement and in the 
lator has a multiplier and an adder, which is connected overall circuit. The latches do not carry out any compu- 
downstream from this multiplier and links the output of tational functions but ensure merely the synchronous 
the multiplier to an output of this circuit arrangement, sequencing of the overall circuit, and are required in 
in a recursive manner. order to implement algorithmically dependent time 

Advantageous developments of the present invention 25 delays. Located at the end of the adder chain of the 
are as follows. The matrix multiplier has two dual-port systolic arrangement is a final adder Finl, which corn- 
memory units and a systolic chain of multipliers and bines the carry digits of the carry-save adders, which 
adders, this chain being connected downstream from are designated by Add, into the carry-select format. A 
these memory units. preferred embodiment of the multipliers within the 

The adders of the systolic chain are of the carry-save 30 systolic arrangement is described in the work by U. 
type. Ramacher, 1990. 

The matrix multiplier has a systolic chain of multipli- A recursive accumulator, which is shown in FIG. 3, 
ers and adders, at whose end a first final adder of the is connected in the overall circuit to the matrix multi- 
carry-select type is provided. plier which is shown in FIG. 1 and is described in more 

The recursive accumulator has a second final adder 35 detail in FIG. 2 (see FIG. 1). A multiplier AM, down- 
which is connected downstream from the adder and is stream from which an adder AA is connected, Is located 
of the carry-select type. at the input of this recursive accumulator. One input of 

The recursive accumulator has a comparator which this multiplier is connected to the output of the final 

compares the result of the adder with zero. adder of the matrix multiplier. The second input of the 

ttprpp npcrPTPTTHM ttto nPAWTvrc 40 multiplier can be connected to the output of the mem- 
BRIEF DESCRIPTION OF THE DRAWINGS Qry B4 Qr tQ ^ Qf ^ ^ ^ Qf ^ ^ 

The features of the present invention which are be- the constant value 1 depending on the position of the 
lieved to be novel, are set forth with particularity in the switch S3, which is controlled via the signal SENS. The 
appended claims. The invention, together with further second input of the adder AA which is connected 
objects and advantages, may best be understood by 45 downstream from this multiplier is connected to the 
reference to the following description taken in conjunc- output of a second EXOR gate. One input of this second 
tion with the accompanying drawings, in the several EXOR gate is connected to the signal LOOP via the 
Figures of which like reference numerals identify like switch S4 which is controlled via the signal SEN4. This 
elements, and in which: signal LOOP is identical to the signal at the input in3 or 

FIG. 1 shows a signal-flow diagram of a circuit ar- 50 to the output of the memory B3, depending on the posi- 
rangement for calculating matrix operations. tion of the switch S6, which is controlled via the signal 

FIG. 2 shows a preferred implementation of a sub-cir- SEN6. The output of the memory B3 is also designated 
cuit of the circuit from FIG. 1, which is designated, in as LOC in FIG. 3. The output of the adder AA is con- 
FIG. 1, as a matrix multiplier. nected to a second final adder Fin2 via the switch S51 

FIG. 3 shows a preferred implementation of a sub-cir- 55 which is controlled via the signal SEN 51. Connected in 
cuit of the circuit arrangement from FIG. 1, which is parallel with this final adder Fin2 is a comparator Comp 
designated, in FIG. 1, as a recursive accumulator. which compares the output of the adder AAwith 0 and 

DESCRIPTION OF THE PREFERRED Jj5 nCe 23 . a mathematical-sign decision element 

EMBODIMENT mathematical-sign signal, together with the con- 

60 trol signal SEN52, determines the position of the switch 

The FIGS. 4, 5, 6 and 7 show signal-flow diagrams S52 which, depending on the switch position, connects 
for driving memory units Bl, B2, B3 and B4. the output signal, which is designated by CHAIN, of 

As FIG. 1 shows, the circuit arrangement according the final adder Fin2 or the loop signal, which is desig- 
to the invention comprises two sub-circuits which are nated by B3 and is delayed via three latches, to a shifter 
designated as a matrix multiplier MM and as a recursive 65 which is controlled by the signal ShiftEn. The output of 
accumulator KA respectively. The overall circuit has the shifter is connected to an overflow control circuit 
two inputs, which are designated by inl and in3 respec- Overflow, whose output is connected to the memory 
tively, an output which is designated by out3, and a unit B3. 
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The signal path via the adder AA, the final adder 3.1 Multiplication of the result submatrix by a coeffc 
Fin2 or the comparator Comp, the shifter and the over- cient (a dedicated coefficient for each submatrix col- 
flow control circuit, as well as via the memory unit B3 umn). 

and the EXOR gate whose output forms the second 3.2 No multiplication (multiplication by 1). 

input of the adder AA, represents a recursive loop by 5 3.3 Multiplication of the individual submatrix compo- 

means of which the computation operations which are nents by themselves (squaring). 

novel in comparison with the prior art can be carried 4. Addition/accumulation of result submatrices from 

out. 3.1 to 3.3. 

In contrast to the circuit arrangement which is de- 4.1 Addition of the result submatrix to the submatrix 

scribed in the work by U. Ramacher, 1990, the circuit 10 which is stored in the memory unit B3 (local accumula- 

arrangement according to the invention now also makes tion). 

possible matrix addition and component-based min/max 4.2 Addition of the result submatrix to the submatrix 

comparison, in addition to matrix-matrix multiplication. which is applied to the input in3 at the same time (dis- 

The input matrices, split into 4x4 submatrices, are tributed accumulation). 

locally transposed and are then added or multiplied. 15 5. Component-based min/max comparison for the 

The components of the result matrix can optionally be result submatrix from 3. 1 to 3.3. 

squared or multiplied by a scalar. Either a global accu- 5.1 Min/max comparison between the results from 

mulation or a min/max comparison in the row or col- 3.1 to 3.3 and the submatrix stored in the memory unit 

umn direction can be applied to the 4x4 submatrices B3. 

produced in this way. 20 5.2 Min/max comparison between the submatrix from 

The value range of the 16-bit weighting values are 3.1 to 3.3 and a submatrix which is applied to the input 

monitored for an overflow and, if an overflow occurs, in3 at the same time. 

the value is automatically limited. Using a special shift 6. Overflow regulation of the weightings which are 

device, the value range of the weightings can be objec- „ stored in the memory unit B3. 

tively changed, for example halved or doubled in steps, 6.1 Overflow recognition and value saturation of the 

in order thus to avoid an overflow and to utilize the weightings which are to be limited to 16 bits. 

16-bit word width of the weighting memory more effec- 6.2 Controlled, global reformatting of all weighting 

tively. submatrices. 

The development of the circuit arrangement accord- 3Q The computation operations 1.1 to 6.2 which are 

ing to the invention for calculating matrix operations is implemented in the circuit arrangement are carried out 

based on the idea of distributing the overall computa- in different parts of the overall circuit and can be con- 

tion work between a predetermined number of elemen- nected to one another in. specific combinations in order 

tary circuit arrangements, and of partitioning the large to support different neural algorithms, 

matrices which are to be multiplied or added into 4x4 33 The following can be combined: 

submatrices. Each elementary circuit arrangement thus in each case one operation 1.1 to 2.2 with 

processes only a specific row and column region (sub- in each case one operation 3.1 to 3.3 and 

matrices) of the matrix which is held in local memories. in each case one operation 4.1 to 5.2. 

The calculation of large matrices is combined from the The two operations 6.1 and 6.2 can be combined with 

processing of the 4x4 submatrices. During this combi- 4Q 4.1. 

nation, the indices of the submatrices are expanded to The execution of the individual computation opera- 

the overall effective ranges of the large matrices. Oper- tions with the aid of the circuit arrangement, and the 

ations which achieve the solution for large matrices control of the circuit arrangement are described in de- 

from the calculation of submatrices are local and global tail in the following text. 

accumulation and local or global min/max comparison 45 Computation operation 1.1: submatrix in the memory 

of the components. Bl multiplied by the submatrix in the memory B2. 

The circuit arrangement according to the invention The two matrices A and B are loaded into the memo- 
carries out the following simple computation operations ries Bl and B2. Switch SI is in the position B2, switch 
which, linked in different ways, implement the neural S2 is in the position 0 and the signal S/Al has the value 
algorithms. In this case, the term submatrix is used to 50 0. The memory B2 supplies the data B (1,1), B (2,1), 
designate 4X4 matrix blocks which are produced by B (4,4), that is to say the matrix elements of the left, top 
partitioning large matrices. 4x4 submatrix of the matrix B, in a distributed manner 

1. Multiplication of input submatrices with each clock cycle over 16 clock cycles. The control 

1.1 Submatrix A (in the memory Bl) multiplied by signals CEN1, CEN4 at the input registers of the 
submatrix B in the memory (B2); 55 multipliers have the values 0 and are thus not active 

1 .2 Submatrix A (in the memory Bl) multiplied by the except at the following times: in clock cycle 1, CEN1 is 
unit matrix; active and causes the loading of the input register of the 

2. Addition/subtraction of input submatrices uppermost multiplier with the matrix element B (1,1). In 

2.1 Submatrix B (in the memory B2) is added to the the next clock cycle, CEN2 is active and loads the input 
submatrix A (in the memory Bl), previously multiplied 60 of the second multiplier with the matrix element B (2,1). 
by the unit matrix; In the third and fourth clock cycles, the input registers 

2.2 Submatrix B (in the memory B2) is subtracted of the other two multipliers are loaded with CEN3 and 
from submatrix A (in the memory Bl), previously multi- CEN4. In the fifth clock cycle, CEN1 is active again 
plied by the unit matrix; and loads the matrix element B ( 1,2 ) in the first multi- 

2.3 Submatrix B (in the memory B2) pass [sic] 65 plier. This distribution process continues cyclically and 
through the adder chain, no addition; ends after the 1 6th clock cycle when the matrix element 

3. Multiplication of the result submatrices from 1.1 to B (4,4) is loaded into the fourth multiplier by means of 
2.2. CEN4-active. 
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In parallel with this process, the memory unit Bl signal S/Al has the value zero. All the other steps cor- 

holds the submatrix A ( 1,1 ), A ( 4,4 ) in a column- respond to those in the computation operation 1.2. 

based manner in four separate sub-memory units and Computation operation 2.2: Submatrix in the memory 

supplies these data via four separate supply lines to the B2 subtracted from the submatrix in the memory Bl. 

multipliers of the systolic arrangement (see FIG. 2). 5 in this case, the sequence is completely analogous to 

The first multiplier receives its first matrix element A the sequence during the computation operation 2.1. 

(1,1) from the memory Bl at the same time as the matrix Only a change in the mathematical sign is required, 

element B (1,1). In the three subsequent clock cycles, which & out by ^ signal S/A1 bem ^ yen ^ 

^e memory Bl supplies the matrix elements A (2,1), A va l ue i. in consequence, the EXOR gate together with 

*)™te™}me.*^ibc^rt 10 ^ adder connected downstream from it forms the 

*St Hf , C £ leS ' TJ? el T nt + B i U) * two's^omplement of the data which are present in the 

also loaded at the multiplier, the application to the ma- memory B2 

trix elements A (1,1), A (4,1) is repeated. After 16 Computation operation 2.3: Loading the data which 

clock cycles, that is to say once the sequence of the m ^ fa ^ B2 ^ J recurs ^ accu . 

matrix elements A (1,1), . .., A (4,1) has been applied to 15 «aam~J J? 

I** i- c A. x I- . r mulator without addition, that is to say circumventing 

the multipliers of the systolic arrangement four times, ^ ^ fldte chain y s 

the reading for the first multiplier of the chain is com- « . . . * . . . , 

pleted. The second multipUer receives the matrix ele- ™ e com P utetlon . " » earned out m an 

ments A (1,2) A (4,2) from the memory Bl. The third ™*°&>™ n f mer to *? computation operation 2.1, a 

and the fourth multipliers respectively receive the ma- 20 zer f supplied, however, instead of a unit 

trix elements A (1,3), . .., A (4,3) and A (1,4), A (4,4) ^ 18 don f b J ^ switch SI bemg in the posi- 

respectively from the memory Bl with in each case one *5 zero all the clock eye es 

further clock cycle delay. Overall, a delay of three Computation operation 3.1: Multiplication of a result 

clock cycles results between the first and the fourth submatrix by a coefficient , 

multipliers of the chain. In consequence, the last multi- 25 In case ofihss computation operation, the systolic 
plier receives the value of the matrix element A (4,4) for adder cham applies the sums, which are mentioned in 
the fourth arid last time in the 19th clock cycle. ^ desc nption of toe computation operation 1.1, of the 
Each multiplier can start one multiplication of two products from the matrix elements of the lines of matrix 
matrix elements of the matrices A or B respectively per A bv ^e columns of matrix B, to be precise in the se- 
clock cycle. Such a multiplication lasts for seven clock 30 <l uen ce such that the sum of the products from the ma- 
cycles. The accumulation of the results is accomplished *™ dements of the first line of matrix A and the matrix 
via the adder chain of the matrix multiplier in FIG. 2. In elements of the first column of the matrix B is calculated 
clock cycle 8, the uppermost, first adder receives the first » and ^e sum of the products from the matrix ele- 
product A (1,1) . B (1,1) of the first multiplier, adds the ments of the fourth row of the matrix A and the fourth 
zero from switch S2 thereto and passes the result to the 35 column of the matrix B is calculated last, starting from 
subsequent second adder, in the following clock cycle the 27 th clock cycle. The upper 19 bits of the data are 
(clock cycle 9). At the same time as it receives the sum present in the carry-save form and must be converted 
of the first adder from the second multiplier, said sec- into the binary form (final adder Finl) before they can 
ond adder receives the product B (1,2) . A (2,1) which be supplied as input data to the first multiplier AM of 
it adds to the sum (clock cycle 10). The products B (1,3) 40 the recursive accumulator. At the same time, a delay of 
. A (3,1) and B (1,4) . A (4,1) are also added in the one clock cycle is produced by the final adder Finl. 
following two clock cycles. In clock cycle 12, the sum Two further clock cycles are produced to provide the 
of the four products of the matrix elements of the first data at both inputs of the multiplier AM. 
row of the submatrix A and the first column of the The sums of the products of matrix elements coming 
submatrix B are present at the output of the fourth and 45 from the final adder Finl arrive at the multiplier AM in 
final adder. The sums which are still missing are added the sequence described above, and are multiplied by the 
in the further clock cycles. The overall result matrix is coefficients K(l), K(2), K(3) and K(4), which are stored 
completely calculated when the sum of the products of in the memory B4. In this case, the following products 
the matrix elements of the fourth row of the matrix A are formed in series: 
and the matrix elements of the fourth column of the 50 

matrix B appears at the output of the adder chain, in the X(i) . p<i,i), K(i) . P(2,i), K(i) . P(3,i), K(i) . P(4,i), 

28th Clock Cycle. K(2) . Pd,2), K(2) - P(W), ... K(4) . P(4,4). 

Computation operation 1.2: submatrix in the memory 

Bl multiplied by the unit matrix (no multiplication). 1x1 this case, the product P(iJT) is equal to the sum of 

The computation operation 1.2 runs in an analogous 55 ^ products of the matrix elements in the i-th row of 

manner to the computation operation 1.1, the second the rnatrix A and the matrix elements in the k-th column 

matrix B being replaced by the unit matrix. The switch of ^ matrix B. The multiplier requires seven clock 

SI is switched cyclically between the value 1 and the cycles, a new multiplication being started in each clock 

value 0 for this purpose. SI is at 1 in clock cycles 1, 6, cycle. The results appear at the output of the multiplier 

11 and 16, and is at 0 in all the other clock cycles. ' 60 AM from the 21st to the 37th clock cycles. In this case, 

Computation operation 2.1: submatrix in the memory the switch S3 is always in position B4. Computation 

B2 added to the submatrix in the memory Bl. operation 3.2: No multiplication. 

In order to carry out the computation operation 2.1, The switch S3 is in the position 1, as a result of which 

the computation operation 1.2, which causes a multipli- the value 1 is allocated to all the coefficients K(i) for i 

cation of the matrix A which is present in the memory 65 =1, 4. The remaining sequence is identical to the 

Bl, by the unit matrix, is expanded by the addition of a computation operation 3.1. 

matrix B which is present in the memory B2. In this Computation operation 3.3: Squaring of the matrix 

case, the switch S2 is in the position B2. The control components. 
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The switch S3 is in the position L2. The values P(ij) Computation operation 5.1: Component-based min/- 
coming from the adder chain are applied to both inputs max comparison of the result submatrix from 3. 1 to 3.3 
of the multiplier AM. The multiplicand word and the with the submatrix which is stored in the part of the 
multiplier word are thus identical. If all the other steps memory B3 which belongs to the recursive loop, 
are carried out in an analogous manner to computation 5 The switch S4 is in the position LOOP, and the 
operation 3.1, then the product sums P(y) appear in the switch S6 is in the position LOC. The signal S/A2 has 
position of the coefficients K(i) and the product matrix the value 1 and the EXOR gate and the downstream- 
is multiplied by itself. In accordance with the computa- connected adder stage AA are used to form the two's 
tion operation 1.2, the product matrix can, preferably complement of the data passing via the switch S4. The 
also be one of the original matrices. 10 switch S51 is in the position MULT5 and the position of 

Computation operation 4.1.1: Local accumulation of the switch S52 is produced from the result of the com- 

the matrix from 3.1 to 3.3 parator Comp, which tests whether the result of the 

The product sums P(l,l), P(2,l), P(4,4) coming adder ^ greater than or equal to zero. The adder 

from the adder chain of the matrix multiplier are added determines the difference between the value originating 

in a component-based manner to the data S(l,l), S(2,l), 15 from tne multiplier AM and the value of the memory 

S(4,4) which are stored in the memory B3. The new ^3. If ^ difference is positive, the comparator posi- 

values of S(U) which thus result are subsequently tions ^ switch S52 into the position MULT5 during 

stored in the memory B3 again (fetch and add). MAX comparison, and into the position ADD during 

In this case, the switch S4 is in the position LOOP, MIN comparison. If, in contrast, the calculated differ- 

the switch S51 is in the position ADD, the switch S52 is 20 ence * ^g^ve, then the value from the memory B3 is 

in the position CHAIN, and the switch S6 is in the S^ter than the product at the output of the multiplier 

position LOC. The control signal S/A2 has the value AM ' md ^ switch P° sition s are interchanged. The 

zero here, the control signals ShiftEn and OflowEn (see result which u selected via ^ svAtch S52 ™ st °red in 

FIG. 3) both being set to zero initially, here. The data the memory B3 without any further change. 

S(l,l), S(2,l), S(4,4) are read in this sequence out of 25 ^S?^^ 011 °P? ratlon , J* 2: Component-based 

the part of the memory B3 which is connected to the J^ff* ^* panS ? n of su ^tnx from 

recursive loop. Said data are passed via the switches S6 ?'* to . 3 I 3 ™f ^matrix which is applied to the 

and S4 and the EXOR gate to the adder AA, which is ^ m3 at the same time. 

constructed from a series of full-adders. In clock cycle m J^SSfJ 0 ^ £ mpUt ? on H' ^ 

27 sri t\ k sHHpH to pm i\ an rf <zn 1\ « »aaJa t« 30 swltch S6 B here m ^ pontoon GLOBIN and the 

4" ^ } thc Sf cS ™ tch 87 is f 7-5" GLOBOUT - M fte ?r 

The carry word aid sum word are linked in the *f? f*° SOaS t0 * e 5L 

downstrearn-connected final adder Fin2 to form a single ?°TT 
K • w „ rt1 „ & cn u * • j r i j j T7- * value saturation for the 16-bits of data stored m the 

buiary value The 50-bi -wide final adder Fin2 is a com- 35 m B3 (weightings). 

plex circuit having a long delay time, as a result of ^ ^ j^^ft^ wdghtings m neuxal net - 
wkch if possible, it ^ miplemented at oiuy one pomt works h ^ ^ to 16 & 0 g verflow recogni . 

fil^H^^t^S 86 " ? 6 ^ ^ m tionandsaturationareused to suppress the liinits whkh 

^SL? « u *k cl ^ c ^Pf crA ^ can be represented being exceeded or undershot during 

The results, the new values S(l 1), S(4,4 , are 40 ^ learnin ^ ^ ^ comin from ^ swit( * 

passed via the .switch S52 to the shifter. If the value of S52 m ^ted in the overflow unit for overflow and, 

the signal ShiftEn is zero, then the data remain un- p^y, m satiated, if OflowEn=l is set by the 

changed. If OflowEn =0, the downstream-connected controUer (16 bit wordlength). If OflowEn =0, no satu- 

overflow contoUer also has no effect on the data. Six ratioil k carrie d out and the data path has a 50-bit 

clock cycles after the old values of S(iJ) have been read 45 length. An overflow counter, which can be read by the 

out of the memory B3, the new values of S(i,J) are controUer, is incremented whenever an overflow is 

WI l ttenm * . . recognized (OflowStat =1). 

Computation operation 4.1.2: Local accumulation: Computation operation 6.2: Reformatting the 

Resetting of the recursive loop. weighting matrices. 

A zero is added to the product sums P(i,J) in each 50 Before data are written into the part of the memory 

case, instead of the old values S(U). In this case, the B3 which belongs to the recursive loop, they can be 

switch S4 is at zero. The remaining sequence is the same shifted by one position to the right or left. The numeri- 

as in the computation operation 4.1.1. cal range which can be represented can thus be matched 

Computation operation 4.2: Distributed accumulation dynamically during the learning process. The drive via 

of the matrix from 3.1 to 3.3. 55 the signal ShiftEn takes place from the controUer. 

The switch S4 is in the position LOOP, the switch The matrix operations which can be carried out with 
S51 is in the position ADD, the switch S52 is in the the aid of the circuit arrangement have thus been de- 
position CHAIN, and the switch S6 is in the position scribed. The foUowing text is intended to describe the 
GLOBIN. In contrast to the operation 4.1.1, the old operations in the memory units Bl, B4 comprehen- 
values of S(l,l), S(4,4) are not read out of the mem- 60 sively. These further operations are carried out in the 
ory B3 but are taken from the input in3. The accumula- memory units Bl, B4 in order to permit simple and 
tion passes via the switches S6 and S4 and the adders effective data transportation between the circuit ar- 
AA and Fin2 as well as the switches S51 and S52. After rangement and its environment (for example, simulta- 
the switch S52, the results, the new values S(l,l), neous loading, computation and unloading). Further- 
S(4,4), are not written via the shifter into the memory 65 more, the matrix components are sorted and, if desired, 
B3 but are passed directly to the output out3 via the transposed, for processing in the circuit arrangement, 
switch S7. A delay time of five clock cycles is required The individual memory units are selected in the manner 
for this purpose. shown in the signal-flow diagrams in FIGS. 4 to 7. 
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7. Operations in the matrix memory Bl 

7.1.1 Load, not transposed 

7.1.2 Load transposed 

7.2.1 Change in the memory halves 

7.2.2 No change in the memory halves 5 
7.3 Read the data 

8. Operations in the matrix memory B2 
8.1 Load the data 

8.2. 1 Change in the memory halves 

8.2.2 No change in the memory halves 10 

8.3. 1 Read, not transposed 

8.3.2 Read, transposed 

9. Operations in the matrix memory B3 

9.1 Fetch and add, in the half of the memory B3 
belonging to the loop IS 

9.2.1 Change in the memory halves 

9.2.2 No change in the memory halves 

9.3.1 Load/read in the memory half which belongs to 
the interface (Port), not transposed 

9.3.2 Load/read in the memory half which belongs to 20 
the interface, transposed 

10. Operations in the coefficient buffer B4 

10.1 Load the data 

10.2 Read the data 

10.3 Change in the memory halves 25 

11. Control of the bidirectional interface inout2. 

11.1 Interface switched to input 

11.2 Interface switched to output 

In conjunction with most computation operations, 
the interface inout2 is used as an input (1 1.1), and only 30 
for short, precisely specified steps is it used as an output 
(11.2). 

The operations in the memory units are described 
comprehensively in the following text. The terms port- 
side and loop-side are used for brevity in this descrip- 35 
tion. Port-side is intended to mean a memory half which 
belongs to the interface, while loop-side means a mem- 
ory half facing the recursive loop. 

The data A(i,J) are supplied from the exterior via the 
interface inl and pass into the memory Bl. Their se- 40 
quence is A(l,l), A(l,2), A(4,4). The memory Bl 
holds these 16 matrix elements such that the matrix 
elements A(l,l), A(4,l), that is to say the first column 
of the 4X4 submatrix of A, are in the top part of the 
memory and can subsequently be allocated to the multi- 45 
plier 1. The matrix elements A(l,2), ... A(4 F 2), that is to 
say the elements of the second column, are stored in the 
second memory row for the multiplier 2, etc. The word- 
line signals are passed systolically from memory word 
to memory word, using a built-in sequencer, in order to 50 
select the memory cells, so that only one memory word 
is written per clock cycle. The sequencer is triggered by 
a trigger pulse B1PORN from the controller. Further- 
more, B1PORT=0. 

7.1.2 Load, transposed 55 

In contrast to 7. 1. 1, the data are stored in Bl such that 
the data A(l,l), A(l,4) are stored in the first memory 
row, and the data A(2,l), ... A(2,4), are stored in the 
second memory row, etc. A dedicated sequencer, 
which is triggered by B1PORT (B1PORN=0), is re- 60 
quired for the changed sequence. 

7.2.1 Change in the memory halves 

The allocations of the two halves of the memory unit 
Bl, the one half to the input in 1 (loading), the other half 
to the multiplier chain (reading); are interchanged. The 65 
controller initializes the change in the memory halves 
via the control bit BIXCH. The signal BIXCH is passed 
from memory row to memory row in the memory unit 



Bl with a delay of one clock cycle in order to compen- 
sate for the delay time in the adder chain. It is possible 
to change the allocation of the memory halves only 
before a submatrix operation, that is to say at the earliest 
every 16 clock cycles. 

7.2.2 No change in the buffer halves 

The submatrix stored in Bl is required a plurality of 
times successively for calculations (learning process). 
Overwriting of these data is prevented if the change in 
the memory halves is omitted. The data which are 
stored on the multiplier side are maintained until the 
next change in the memory halves. BIXCH does not 
change its value. 

7.3 Read the data 

Reading takes place via a sequencer which is trig- 
gered via B1MULN. See 1.1 to 2.3 for the sequence for 
this purpose. 

8.1 LoadB2 

B2 has 16 memory cells. The data B(ij) are stored in 
the memory in series. The memory fields are selected by 
means of a sequencer, B2PORN being used for trigger- 
ing (see 7.1.1). 

8.2.1 Change in the memory halves 

The change in the memory halves is controlled by the 
controller via B2XCH. The change covers all the buffer 
halves at the same time. A change is possible with every 
new submatrix operation, at the earliest every 16 clock 
cycles. 

8.2.1 No change in the buffer halves 
If the value of B2XCH is maintained, the buffer 
halves are not changed. 

8.3.1 Read, not transposed 

The memory cells are read in the same sequence in 
the other memory half, in parallel with writing: Trigger 
pulse B2MULN. 

8.3.2 Read, transposed 

The sequence of the data read corresponds to the 
transposed matrix: Trigger pulse B2MULT. 

9.1 Fetch-and-add in B3, loop-side 

For the fetch-and-add operation, the data which are 
to be written in 9. 1 are read from the same memory cells 
of the memory B3 in which they will subsequently be 
stored again, 6 clock cycles in advance (trigger: 
B3LOOP). 

9.2.1 Change in the buffer halves 

The change is controlled by the controller via 
B3XCH (see 8.2.1). 

9.2.1 No change in the buffer halves 

The value of B3XCH is maintained ( see also 8.2.2). 

9.3.1 Read, port-side, not transposed 

The data located in B3 (port-side) are read in the 
sequence S (1,1), S (2,1), S (4,4) and are replaced by 
values which are simultaneously present at the input. 
The sequencer is triggered by B3PORN. 

9.3.2 Read, port-side, transposed 

The data located in B3 (port-side) are read in the 
sequence S (1,1), S (1,2), S (4,4) and are replaced by 
values which are present at the input of B3 at the same 
time. The sequencer is triggered by B3PORT. 

10.1 Load the data in B4 

The data in B4 (coefficients) are supplied to the cir- 
cuit arrangement via the interface in3 (OUT3). Four 
clock cycles are required for loading. The trigger pulse 
for the sequencer is B4PORN. 

10.2 Read the data 

The trigger pulse for reading is B4MULN. 

10.3 Change in the memory halves 
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Four further memory fields are provided for storage 
of a second set of coefficients. Selection takes place via 
B4XCH. 

11.1 Interface INOUT2 switched to input 

The input drivers of the interface INOUT2 are al- 5 
ways active. 

11.2 Interface INOUT2 switched to output. 

The changeover of the interface INOUT2 to output 
operation takes place via an external control bit 
PORT2C, in order to prevent external driver conflicts. 10 

The circuit arrangement described relates to a circuit 
module which, connected together with identical cir- 
cuit modules, produces a coprocessor for carrying out 
matrix operations. This coprocessor is supplied with the 
necessary control signals from a controller circuit. 15 

The internal construction of the controller and also of 
the memory units is described in the diploma thesis by 
U. Hachmann, "Controller Architektur fur einen 
Neuroemulator in Form eines systolischen Arrays" 
[Controller architecture for a neural emulator in the 20 
form of a systolic array], Department of Electrical En- 
gineering Components, Dortmund University, Dort- 
mund 1990. 

Specialist terms which are in normal use and are used 
herein for designating digital circuits are explained, for 25 
example, in Kai Hwang, "Computer Arithmetics Prin- 
ciples, Architecture and Design", John Wiley & Sons, 
1979. 

The invention is not limited to the particular details of 
the apparatus depicted and other modifications and 30 
applications are contemplated. Certain other changes 
may be made in the above described apparatus without 
departing from the true spirit and scope of the invention 
herein involved. It is intended, therefore, that the sub- 
ject matter in the above depiction shall be interpreted as 35 
illustrative and not in a limiting sense. 

What is claimed is: 

1. A circuit arrangement for calculating matrix opera- 
tions of submatrices, comprising: 

a matrix multiplier and a recursive accumulator, 40 

which is connected downstream from said matrix 

multiplier; 

first and second input terminals, and one input/output 
terminal; 

said matrix multiplier having the following units: 45 
dual-port first and second memory units for storage 
of input submatrices, the first memory unit being 
connected to the first input terminal and the second 
memory unit being connected to the input/output 
terminal, and the first memory unit having k inde- 50 
pendent memories, where k is a whole number 
greater than one; 

a systolic arrangement, which is connected to the first 
and second memory units, said systolic arrange- 
ment having k multipliers connected to k series- 55 
connected adders, respectively, for multiplication 
and addition/substraction of input submatrices, the 
first memory unit storing rows of an input subma- 
trix in k separate submemory units, the k multipli- 
ers connected via k supply lines, respectively, to 60 
the -k separate sub-memory units, respectively, 
means for controlling said k multipliers such that 
each multiplier, with the associated series-con- 
nected adder successively, and in a synchronized 
manner, carries out a multiplication of a matrix 65 
element from a corresponding row of the first 
memory unit with a matrix element of a row of a 
submatrix from the second memory element, the k 
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series-connected adders also connected to form an 
adder chain wherein results of said multiplication 
of matrix elements are accumulated to form prod- 
uct sums; 

a first final adder of a carry-select type connected at 
an end of said adder chain; and 

the recursive accumulator having the following units: 
third and fourth memory units, each of which 
being connected to the second input terminal, the 
third memory unit storing weightings and the 
fourth memory unit storing coefficients; 

a further multiplier for at least one of multiplying 
product sums in the first final adder by the coeffici- 
ents and squaring the product sums, said further 
multiplier connected to one of an output of the first 
final adder and an output of the fourth memory 
unit; 

a further adder, which is connected downstream 
from the further multiplier, for component-based 
addition of product sums with the weightings from 
the third memory; 

a second final adder, which is connected downstream 
from the further adder and is of a carry-select type; 
and . 

a recursive loop via the further adder, the second 
final adder or a comparator, a shifter and an over- 
flow-monitoring circuit, and via the third memory 
unit for component-based MIN/MAX comparison 
of resulting submatrices with a submatrix which is 
present at the second input terminal, and for over- 
flow regulation of the weightings which are stored 
in the third memory unit, said recursive loop 
formed by an output of the further adder being 
connected to one of an input of the second final 
adder and an input of the comparator, an input of 
the third memory unit being connected via the 
shifter and the overflow-monitoring circuit to one 
of an output of the second final adder and an output 
of the comparator, and an output of the third mem- 
ory unit being connected to an input of the further 
adder. 

2. A circuit arrangement for calculating matrix opera- 
tions of submatrices, comprising: 
a matrix multiplier and a recursive accumulator, 

which is connected downstream from said matrix 

multiplier; 

a first input terminal and a third input terminal; 

said matrix multiplier having dual port first and sec- 
ond memory units for storage of input submatrices, 
the first memory unit being connected to the first 
input terminal and the second memory unit being 
connected to the third input terminal, and the first 
memory unit having k independent memories, 
where k is a whole number greater than one; 

said matrix multiplier having a systolic arrangement, 
which is connected to the first and second memory 
units, said systolic arrangement having k multipli- 
ers connected to k series-connected adders, respec- 
tively, for multiplication and addition/subtraction 
of input submatrices, the first memory unit storing 
rows of an input submatrix; 

said matrix multiplier having means for controlling 
said k multipliers such that each of said k multipli- 
ers, with an associated series-connected adder, 
successively and in a synchronized manner, carries 
out a multiplication of a matrix element from a 
corresponding row of the first memory unit with a 
matrix element of a column of a submatrix from the 
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second memory element, the k series-connected 
adders being connected to form an adder chain 
wherein results of said multiplication of matrix 
elements are accumulated to form product sums; 
and 5 

said matrix multiplier having a first final adder con- 
nected to an end of said adder chain. 

3. A circuit arrangement for calculating matrix opera- 
tions of submatrices, comprising: 

a matrix multiplier and a recursive accumulator, 10 
which is connected downstream from said matrix 
multiplier, 

a first input terminal, a second input terminal, and a 
third input terminal; 

said matrix multiplier having dualport first and sec- 15 
ond memory units for storage of input submatrices, 
the first memory unit being connected to the first 
input terminal and the second memory unit being 
connected to the third input terminal, and the first 
memory unit having k independent memories, 
where k is a whole number greater than one; 

said matrix multiplier having a systolic arrangement, 
which is connected to the first and second memory 
units, said systolic arrangement having k multipli- 25 
ers connected to k series-connected adders, respec- 
tively, for multiplication and addition/subtraction 
of input submatrices, the first memory unit storing 
rows of an input submatrix; 

said matrix multiplier having means for controlling 30 
said k multipliers such that each of said k multipli- 
ers, with an associated series-connected adder, 
successively and in a synchronized manner, carries 
out a multiplication of a matrix element from a 
corresponding row of the first memory unit with a 35 
matrix element of a column of a submatrix from the 
second memory element, the k series-connected 
adders being connected to form an adder chain 
wherein results of said multiplication of matrix 
elements are accumulated to form product sums; 40 

said matrix multiplier having a first final adder con- 
nected to an end of said adder chain; 

the recursive accumulator having a third memory 
unit connected to the second input terminal, the 
third memory unit storing accumulation results; 45 

the recursive accumulator having a further adder, 
which is connected downstream from the first final 
adder of the matrix multiplier, for component- 



based addition of said product sums with contents 
of the third memory; 

the recursive accumulator having a second final ad- 
der, which is connected downstream from the fur- 
ther adder, results of the second final adder being 
stored in the third memory unit; and 

the recursive accumulator having a recursive loop via 
the further adder, the second final adder and via 
the third memory unit for component-based accu- 
mulation of resulting submatrices and means to 
perform accumulation with a submatrix which is 
present at the second input terminal, said recursive 
loop formed by connection of the further adder to 
the second final adder, by connection of said sec- 
ond final adder to said third memory unit, and by 
connection of said third memory unit connected 
said further adder. 

4. The circuit arrangement as claimed in claim 3, 
wherein said circuit arrangement further comprises; 

a fourth memory unit for storing coefficients, said 
fourth memory unit being connected to the second 
input terminal; 

a further multiplier for multiplying product sums in 
the first final adder by said coefficients, said further 
multiplier being connected to an output of the first 
final adder. 

5. The circuit arrangement as claimed in claim 3, 
wherein for a component-based MIN/MAX compari- 
son, said circuit arrangement further comprises: 

a fourth memory unit connected to the second input 
terminal, the fourth memory unit storing accumu- 
lation results; and 

a further multiplier for squaring accumulation results 
in the first final adder, said further multiplier being 
connected between an output of the first final 
adder and an input of the further adder, said further 
multiplier also connected to said fourth memory 
unit 

6. The circuit arrangement as claimed in claim 3, 
wherein said circuit arrangement further comprises: 

a fourth memory unit for storing coefficients, said 
fourth memory unit being connected to the second 
input terminal; 

a further multiplier for multiplying product sums in 
the first final adder by said coefficients, said further 
multiplier being connected to an output of the 
fourth memory unit. 
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ABSTRACT 



A matrix multiplier used in a picture transforming coder 
is provided for multiplying an input signal matrix by a 
coefficient matrix configured to comprise as many con- 
stant multipliers as absolute values of coefficients in the 
transform coefficient matrix for handling the signal 
matrix as common multiplicands, a plurality of selectors 
for selecting values necessary for computing elements 
of the matrix product from multiplication results output 
by the constant multipliers and a respective accumula- 
tor assigned to each of the selectors for accumulating 
the selected values to finally provide an element of the 
matrix product. As a result, since the processing can be 
done by merely performing as many fixed multiplica- 
tions as absolute values of the coefficients in the trans- 
form coefficient matrix, the multipliers can be imple- 
mented as a relatively simple combination of adders, 
allowing the entire size of a picture information trans- 
forming coder or the like to be reduced. 

6 Claims, 7 Drawing Sheets 
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MATRIX MULTIPLIER AND PICTURE 
TRANSFORMING CODER USING THE SAME 

BACKGROUND OF THE INVENTION 5 

(1) Field of the Invention 

The preaent invention relates to a matrix multiplier 
and a picture transforming coder using the matrix multi- 
plier. In particular, the present invention is applicable to 
a coder and decoder for audio signals and standstill as 10 
well as moving picture information and relates to a 
matrix multiplier utilized in a transforming/coding unit 
for matrix multiplication of a coefficient matrix used in 
the coder by a signal matrix and a picture transforming 
coder utilizing the matrix multiplier. 15 

(2) Description of the Prior Art 

In many cases, * technique referred to hereafter as 
DCT (Discrete Cosine Transform) is used in an appara- 
tus for efficiently coding a picture signal or the like. 
DCT is a kind of frequency transformation which is 20 
similar to Fourier transformation. For example, let 
input picture information be treated as a block of pixels 
comprising 8 horizontal rows and 8 vertical columns. 
Such a block of pixels is represented by a matrix [x] each 
element of which corresponds to a pixel. DCT results, 25 
or what is called DCT coefficients^ are represented by a 
matrix [X]. Using transform coefficients [C] and its 
transposed matrix t[c], the relations between the matri- 
ces [x] and [X] are given by Equation (1). In many cases, 
DCT and inverse DCT processings are carried out in 30 
accordance with the matrix relations given by Equation 
1. 



Equation (1): 

DCT[jr)=[qxDxt r q 



35 



Invent DCT Q= ^C] X [X ) X [Q 

where t[ ] denotes a transposed matrix and the matrix 
[C] is expressed by the following equation: 



[q. 



/Coo Co-i Cm Cm Qm Cm Cm Cd^ 

Ci-0 Cm Cm Q.3 Cm Ci.5 C w Ci.7 
Cj-o C 2 .\ C 2 .2 C 2 .3 C 24 C 2 . 5 C 2 .6 C 2 . 7 
Cyo C3.1 C 3 . 2 C3.3 C34 C3.3 C34 C3.7 

Cmj C4.J c+2 C4.3 Cm C4.3 C4.6 C4.7 
Cw Cyi C5.2 C3.3 C34 C3.3 C3* C3.7 
Cm Cm Cm Cm Cm C« Cw Q-7 

^C 7 <, Cm C 7 -2 C7.3 C74 C7.3 C7-6 C7.7y 
where 



= 1/V7 for / = 0 
A(0 - lfor/= I,... ,7 
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FIG. 1 shows a general configuration of a DCT pro- 
cessing apparatus for executing DCT and inverse DCT 
manipulations. In the case of the DCT processing, the 
picture information matrix [x] is used as an input matrix €5 
11. As for the inverse DCT processing, it is the DCT 
coefficient matrix [X] which is used as the input matrix 
11. The input matrix 11 is multiplied by transform coef- 



ficients 12 stored in a coefficient bank 104 by means of 
a matrix multiplier (A) denoted by reference numeral 
101. The results of the matrix multiplication are stored 
temporarily in a transpose buffer 102. The results stored 
in the transpose buffer 102 are further multiplied by 
transform coefficients 13 stored in a coefficient bank 
105 by means of a matrix multiplier (B) denoted by 
reference numeral 103. The results of the matrix multi- 
plication are obtained as an output matrix 14. In this 
case, the coefficient matrices 12 and 13 are the trans- 
form coefficients [C] and its transposed matrix t[C] 
respectively. 

FIG. 2 shows a typical configuration of the conven- 
tional matrix multiplier for performing the DCT and 
inverse DCT manipulations. In other words, FIG. 2 
shows the configuration of the matrix multiplier (A) or 
(B) denoted by reference numeral 101 or 103 in FIG. 1. 
Each multiplier 201 is used for multiplying an element 
on a row (or column) of an input matrix 21 by an ele- 
ment on a column (or row) of a transform coefficient 
matrix 23 coming from a transform coefficient genera- 
tor 205. The result of the multiplication is stored in an 
accumulator 203 through an adder 202. The multiplica- 
tion is repeated for all elements on each row (column) 
of the input matrix 21 and for all elements on each col- 
umn (row) of the transform coefficient matrix 23 and 
the result of each multiplication is added to the contents 
of the accumulator 203. As all the elements are multi- 
plied, the accumulators 203 are selected one after an- 
other by a selector 204 in order to output their contents 
sequentially as an output matrix 22. 

A typical configuration of the multiplier 201 is shown 
in FIG. 3. For the Ith bit of a multiplier 32, a multipli- 
cand 31 is arithmetic-left-shifted I times, where 1=0 to 
n. Note that arithmetic-left-shifting the multiplicand 31 
I times is equivalent to multiplying the multiplicand 31 
by the Ith power of 2. If the Ith bit of the multiplier 32 
is a T then the arithmetic-left-shifted result is selected 
by a selector 301 and added to an output by an adder 
302. If the Ith bit of the multiplier 32 is a '0* however the 
arithmetic-left-shifted result is not selected by the selec- 
tor 301 and, thus, not added to the output by tile adder 
302. 

SUMMARY OF THE INVENTION 

In the configuration of the matrix multiplier shown in 
FIG. 1, as many multipliers 201 as rows (or columns) of 
the transform coefficients are required. In addition, the 
transform coefficients which are used as multipliers in 
the multipliers 201 may vary from row to row (or col- 
umn to column) of elements of the input matrix. Ac- 
cordingly, a circuit having a large scale is required, in 
which case operations must be carried out at a high 
speed. In general, it is difficult to implement a multiplier 
with a small physical sized hardware. For example, in 
order to multiply an N-digit multiplicand by an M-digit 
multiplier using the typical multiplier shown in FIG. 3, 
it is necessary to provide N xM units of 1-bit full ad- 
ders. Let us take a video telephone conforming to the 
international standards as an example. In order to per- 
form real-time processing within the same time frame as 
the input signal with a sufficient accuracy, 8 units of 
matrix multipliers are required for simultaneously multi- 
plying an input signal matrix comprising 8 rows and 8- 
columns of elements with each element being a 16-bit 
binary number by a transform coefficient matrix com- 
prising 8 rows and 8 columns of elements with each 
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element being a 14-bit binary number. If the processing 
is to be done by the matrix multiplier shown in FIG. 2, 
1,792 (16X 14 x 8) units of adders are required. In order 
to carry out both the DCT and inverse DCT process- 5 
ings, the number of required adders is equal to this 
count further multiplied by 4. 

It is therefore an object of the present invention to 
provide a matrix multiplier for multiplying an input 
signal matrix by a transform coefficient matrix with a 10 
multiplying unit thereof implemented using a reduced 
number of circuit elements. 

It is another object of the present invention to pro- 
vide a low-cost picture coder used in video telephones 15 
and the like for performing DCT transformations. 

In order to achieve the above objects, a multiplying 
unit for multiplying an input signal matrix by a trans- 
form coefficient matrix is configured in accordance 
with the present invention to comprise: 20 

as many coefficient multiplying means as different 
absolute values of coefficients in said transform coeffici- 
ent matrix for multiplying said transform coefficients by 
input signal values with said input signal matrix used as 2 5 
common multiplicands for said coefficient multiplying 
means; 

a plurality of selectors for selecting values required 
for computing elements of a matrix product among 
multiplication results output by said coefficient multi- 30 
plying means; and 

an accumulating means associated with each of said 
selectors for accumulating multiplication results to be 
output as an element of said matrix product. 35 

In addition, the coefficient multiplying means are 
binary multipliers configured to share common process- 
ing means. The coefficient multiplying means are used 
to simultaneously multiply a signal value by 2 or more 
transform coefficients. Every common processing 40 
means described above is used for processing each bit 
pattern in the binary codes common to the transform 
coefficients. 

Effective embodiments according to the present in- 45 
vention include a coder for processing picture informa- 
tion and the like using the DCT transformation. None- 
theless, applications of the present invention are not 
limited to such a coder. 

The matrix multiplier according to the present inven- 50 
tion employs as many multipliers as different absolute 
values of elements Ci, j of a transform coefficient matrix 
[C]. In addition, the transform coefficients which are 
used as multipliers do not vary during a period of input- 55 
ting a row (or column) of a multiplicand matrix. Let a 
matrix given by Equation (2) below be the transform 
coefficient matrix [C] having elements Ci, j actually 
used for the transformation. In Equation 2, the absolute 
values of the elements Ci, j are represented by CI to C7. *° 
As such, the number of element values including nega- 
tive ones is 14. Accordingly, only 7 multipliers are re- 
quired for multiplying the seven coefficients, CI to C7, 
by an input signal value. Multiplication results required 65 
for computing elements of a matrix product are selected 
by the selectors and then accumulated in the accumulat- 
ing means. 



4 

Equation (2) 
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«-+«(+)■ 




(-ft-) 

Accordingly, for few coefficient absolute values of 
the transform coefficient matrix [C], the number of 
multipliers can be reduced, thus resulting in an appara- 
tus with a small size. 

In addition, the multipliers are configured to share 
common processing means for manipulating common 
bit patterns in the binary codes of the transform coeffici- 
ents. 

Each of the multipliers is used to multiply an input 
signal by a transform coefficient which does not vary. 
Accordingly, each multiplier can be implemented as a 
relatively simple combination of adders. By providing 
in advance a shared processing means for each bit pat- 
tern in the binary codes common to transform coeffici- 
ents, the amount of processing can be further reduced. 
In order to obtain results of the matrix multiplication, 
only necessary outputs of the multipliers are sequen- 
tially selected and then accumulated. 

The foregoing and other objects, advantages of oper- 
ation and novel features of the present invention will be 
understood from the following detailed description 
when read in connection with the accompanying draw- 
ings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a diagram showing a general configuration, 
of a DCT processing apparatus. 

FIG. 2 is a diagram showing the configuration of the 
conventional matrix multiplier. 
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FIG. 3 is a diagram showing the configuration of the lators 404-j, where i = 1 to 7 and j = 1 to 8, are used to 

conventional multiplier. carry out the matrix multiplication for the next input 

FIG. 4 is a diagram showing the configuration of an multiplicand matrix in the same way as the preceding 

embodiment implementing a matrix multiplier in accor- matrix multiplication. 

dance with the present invention. 5 FIG. 5 is a diagram showing the configuration of 

FIG. 5 is a diagram showing tile configuration of another embodiment implementing a matrix multiplier 

another embodiment implementing a matrix multiplier in accordance with the present invention. In the latter 

in accordance with the present invention. matrix multiplication of the DCT transformation and 

FIG. 6 is a diagram showing the configuration of the the former matrix multiplication of the inverse DCT 

conventional coefficient multipliers. 10 transformation, the number of coefficients used in the 

FIG. 7 is a diagram showing the operational configu- matrix multiplications is even smaller. The fact that the 

ration of an embodiment implementing coefficient mul- coefficient count is lower allows the configuration of 

tipliers in accordance with the present invention. the selectors 402 in the embodiment to be further simpli- 

FIG. 8 is a diagram showing an actual circuit config- Med. To be more specific, in addition to a reduced num- 

uration of the multipliers 401-1 to 401-7 employed in the 15 ber of selectors 402, each of the selectors 402 can also be 

matrix multiplier for manipulating the DCT processing made much simpler. The matrix expressed by Equation 

shown in FIG. 4 or 5. (2) clearly indicates that the selectors 402-1 and 402-5 

DESCRIPTION OF THF PRFFFKRFn dways 861601 P roducts of multiplying the transform 

D^CRIFHONOF TlffiFREFERRED coefficient C4. Accordingly, the selectors 402-1 and 

cniKJUiHitiri i » 20 402-5 and their select control lines 49 are not required. 

Referring to diagrams, embodiments according to the As for the selectors other than the 402-1 and the 402-5, 

present invention are described as follows. their number of inputs to be selected is reduced to 4 or 

FIG. 4 is a diagram showing the configuration of an 2, resulting in a simplified circuit of the selector control- 
embodiment implementing a matrix multiplier in accor- ler 405 and simplified select control lines 49. 
dance with the present invention. The embodiment is 25 FIG. 7 is a diagram showing the operational configu- 
configured to implement a circuit for executing the ration of the coefficient multipliers 401 used in the ma- 
DCT code transformation on picture information com- trix multiplier in accordance with the present invention, 
prising 8x8 pixels. The pixels each correspond to an A multiplier for multiplying coefficients is implemented 
element of a matrix. The elements of the matrix to be as a combination of adders such as the conventional one 
multiplied simultaneously are supplied as matrix multi- 30 shown in FIG. 6. As shown in FIG. 6, a procedure for 
plicand inputs 41 which are fed to a plurality of multi- computing a 26 times output 62 and a 45 times output 63 
pliers 401-1 to 401-7. The elements are each multiplied of a multiplicand input 61 is given as follows. The bi- 
by transform coefficients 42 to 48 having seven different nary format of the number 26 is known to be '1 1010' 
values represented by CI to C7 in Equation (2). Only which is the sum of the 16 times, the 8 times and the 2 
values required for computing the matrix product are 35 times. Accordingly, the 26 times output can be obtained 
selected from results obtained from the multipliers by adding the 16 times, the 8 times and the 2 times with 

401- 1 to 401-7 by selectors 402-1 to 402-8 in accordance adders 604 and 605. Similarly, the binary format of the 
with the matrix product equation for the matrix of number 45 is known to be * 101 101' which is the sum of 
Equation (2). Note that each of the selectors 402-1 to the 32 times, the 8 times, the 4 times and the 1 time. 

402- 8 is associated with an element of the matrix prod- 40 Accordingly, the 45 times output can be obtained by 
uct. The order in which the values are selected by one adding the 32 times, the 8 times, the 4 times and the 1 
of the selectors 402-1 to 402-8 is determined by a select time with adders 601, 602 and 603. In this case, since 
control line 49 which is programmed by a selector con- multiplying a number by the nth power of 2 is equiva- 
troller 405. It should be noted that for the sake of dia- lent to arithmetic-left-shifting the number, no special 
grammatical simplicity, the select control line 49 is 45 processing is required. Accordingly, five units of adders 
denoted by a single line in the figure. In the actual con- are used as is shown in the figure. 

figuration, however, each of the selectors 402-1 to 402-8 On the other hand, with a technique provided by the 

is provided with one independent select control line. present invention, a common bit pattern in the binary 

A selected value is then added to or subtracted from codes of the coefficients is identified and assigned to a 

contents 50 of an accumulator 404 by an adder/subtrac- 50 common processing means to be shared by the multipli- 

tor 403. The selection as to whether an addition or ers. For example, the binary format of the number 26 is 

subtraction is to be carried out depends upon whether •11010' whereas that of the number 45 is '101101' as 

the value of the transform coefficient multiplied is posi- described above. A bit pattern 4 10r is common to both 

tive or negative. the binary codes '11010* and '101 101*. Accordingly, 

As the matrix multiplication for an input matrix is 55 when computing 26 times and 45 times outputs at the 
completed, the accumulators 404 each contain an ele- same time, the 5 times calculation corresponding to the 
ment of the matrix product. The elements of the matrix common bit pattern *101* can be carried out by a com: 
product are sequentially fetched by a selector 406 and mon processing means provided in advance and to be 
transmitted as a matrix processing output 51. The selec- shared by the multipliers. The 26 times output is then 
tor 406 has the same configuration as the conventional 60 computed by adding the product of the 5 times and the 
•elector shown in FIG. 2 which the reader is already 2 times to the 16 times while the 45 times output is 
familiar with. In actuality, each of the accumulators calculated by adding the product of the 5 times and the 
404-1 to 404-8 is provided with a register for storing an 8 times to the 5 times. In this way, the processing can be 
element of the matrix product. At the completion of the simplified. To be more specific, a 4 times value 72 of a 
matrix multiplication, the contents of each register are 65 multiplicand input 71 is added to the multiplicand input- 
reset to zero. The contents of each register are output as 71 by an adder 701 to result in a 5 times value 73 which 
an output matrix 81 at fixed timing. In the mean time, is multiplied by 2 and then added to a 16 times value 74 
the multipliers 401-i, the selectors 402-j and the accumu- of the multiplicand input 71 by using an adder 702 to 
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give the 26 times output 75. At the same time, the 5 
times value 73 is multiplied by 8 to result in a 40 times 
value 76 which is added to the 5 times value 73 to give 
the 45 times output 77. In the embodiment shown in 
FIG. 7, the number of adders is thereby reduced to 5 
three. 

It is obvious that the technique adopted in the above 
embodiment is not limited to the computation of the 26 
and 45 times but can also be applied to all cases as well. 
In addition, any common bit pattern can always be 10 
assigned to a shared processing means as in the case 
with the common bit pattern *101\ 

FIG. 8 is a diagram showing an actual circuit config- 
uration of the multipliers 401-1 to 401-7 employed in the 
matrix multiplier for manipulating the DCT processing 15 
shown in FIG. 4 or 5. In this case, it is assumed that in 
the configuration shown in FIG. 8, coefficients to be 
multiplied by input data are alpha X CI to alpha XC7, 
where alpha = 4,096 X 2VT Accordingly, if the value of 
alpha changes, the number of times the arithmetic-left- 20 
shift operation to be performed also varies. Outputs of 
adders 501-1 to 501-7 shown in FIG. 8 correspond to 
the outputs of the multipliers 401-1 to 401-7 shown in 
FIG. 4 or 5. With the configuration of the multipliers 
shown in FIG. 8, the number of adders can be reduced 25 
to only 13. Therefore, in order to give a 16-bit process- 
ing accuracy, only 178 (—13x16) units of adders are 
needed. When compared to the 1,792 units of adders 
required in the multipliers adopting the conventional 
technique, the number 178 clearly means that the DCT 30 
transformation can be carried out to achieve the same 
processing accuracy using only one tenth as much of 
the conventional hardware. 

The above embodiments are described mainly for 
applications for manipulating 8x8 elements. It should 35 
be noted, however, that applications of the present 
invention are not limited to the processing of 8X8 ele- 
ments. The present invention is particularly effective 
when applied to a code transforming apparatus for han- 
dling picture information even though it is not re- 40 
stricted to such an application. 

In the matrix multiplier provided by the present in- 
vention, the multipliers for outputting elements of a 
matrix product have a configuration in which coeffici- 
ents are treated as multiplier inputs. In addition, the 45 
number of multipliers is independent of the number of 
elements in a row or column of the input multiplicand 
matrix and determined only by the number of absolute 
values of the coefficients in the transform coefficient 
matrix. Accordingly, the number of multipliers can be 50 
reduced. As a result, the invention allows a matrix mul- 
tiplier having a small size and a low cost to be imple- 
mented. In particular, the effects of the present inven- 
tion on the implementation and popularization of a 
transforming coder for picture information are substan- 55 
tial since the reduction in apparatus size and cost is a 
very important requirement. 

We claim: 

1. A picture transforming coder for efficiently coding 
a picture signal comprising: 60 

a first matrix multiplier for performing matrix pro- 
cessing on an input signal and a first transform 
coefficient matrix; and 

a second matrix multiplier for performing matrix 
processing on outputs of said first matrix multiplier 65 
and a second transform coefficient matrix which is 
the transposed matrix of said first transform coeffi- 
cient matrix, said second matrix multiplier output- 
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ting matrix processing results as a coded picture 
signal; 

wherein said first matrix multiplier includes: 

a plurality of constant multiplying means for receiv- 
ing said input signal in predetermined block units, 
handling each of said block units as a multiplicand 
matrix and multiplying each element of said multi- 
plicand matrix by each element of said first trans- 
form coefficient matrix, 

a plurality of selectors for selecting necessary output 
values from multiplication results output by said 
constant multiplying means, 

a control means for controlling said selectors by fol- 
lowing predetermined rules established in accor- 
dance with said first transform coefficient matrix, 
and 

a plurality of accumulating means for storing ele- 
ments of the matrix product by accumulating said 
necessary output values with each of said accumu- 
lating means assigned to one of said selectors. 

2. A picture transforming coder according to claim 1, 
wherein said input signal received in said block units is 
two-dimensional information on a plurality of pixels 
adjacent to each other; 

said first transform coefficient matrix is a DCT coeffi- 
cient matrix; and 

said second transform coefficient matrix is the trans- 
posed matrix of said DCT coefficient matrix. 

3. A picture transforming coder according to claim 1, 
wherein said constant multiplying means is configured 
to include a common numerical processing means 
shared by a first numerical processing means for multi- 
plying a first coefficient of said first transform coeffici- 
ent matrix by a matrix element of said multiplicand 
matrix, a second numerical processing means for multi- 
plying a second coefficient of said first transform coeffi- 
cient matrix by said matrix element and any other nu- 
merical processing means for multiplying other coeffici- 
ents of said first transform coefficient matrix by said 
matrix element so that said constant multiplying means 
are capable of multiplying two or more coefficients of 
said first transform coefficient matrix by an element of 
said multiplicand matrix; and 

said common numerical processing means is used for 
handling a bit pattern common to the binary codes 
of said first coefficient, said second coefficient and 
said other coefficients being multiplied. 

4. A matrix multiplier applicable to an apparatus for 
performing predetermined coded on an input signal 
comprising: 

a plurality of constant multiplying means, connected 
to receive said input signal in predetermined block 
units, for each handling a respective one of said 
block units as a multiplicand matrix and for multi- 
plying each element of said multiplicand matrix by 
a respective of a coefficient matrix; 

a plurality of selectors each for selecting necessary 
output values from multiplication results output by 
said plurality of constant multiplying means; 

control means for controlling said selectors by fol- 
lowing predetermined rules established in accor- 
dance with said coefficient matrix; and 

a plurality of accumulating means each storing ele- 
ments of a matrix product by accumulating respec- 
tive ones of said necessary output values received- 
from a respective one of said selectors. 

5. A matrix multiplier according to claim 4, wherein 
said input signal received in said block units is two-di- 
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mensional information on a plurality of pixels adjacent 
to each other; and 
said coefficient matrix is a DCT coefficient matrix of 
the transposed matrix of said DCT coefficient ma- 
trix. 5 
6. A matrix multiplier applicable to an apparatus for 
performing predetermined coding on an input signal 
comprising: 

a plurality of constant multiplying means for receiv- 
ing said input signal in predetermined block units, 10 
handling each of said block units as a multiplicand 
matrix and multiplying each element of said multi- 
plicand matrix by each of a coefficient matrix; 

a plurality of selectors for selecting necessary output 
values from multiplication results output by said 15 
constant multiplying means, a control means for 
controlling said selectors by following predeter- 
mined rules established in accordance with said 
coefficient matrix; and 

a plurality of accumulating means for storing ele- 20 
ments of the matrix product by accumulating said 
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necessary output values with each of said accumu- 
lating means assigned to one of said selectors; 

wherein said constant multiplying means is config- 
ured to include a common numerical processing 
means shared by a first numerical processing means 
for multiplying a first coefficient of said coefficient 
matrix by a matrix element of said multiplicand 
matrix, a second numerical processing means for 
multiplying a second coefficient of said coefficient 
matrix by said matrix element and any other nu- 
merical processing means for multiplying other 
coefficients of said coefficient matrix by said matrix 
element so that said constant multiplying means are 
capable of multiplying two or more coefficients of 
said coefficient matrix by an element of said multi- 
plicand matrix; and wherein 

said common numerical processing means is used for 
handling a bit pattern common to the binary codes 
of said first coefficient, said second coefficient and 
said other coefficients being multiplied. 
***** 
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